When the Output Looks Clean - "Confident & Wrong" Part One

A thematic summary lands on your desk. Three hundred and forty-two people answered an open-ended survey question, and an AI tool has sorted their responses into five clean themes, each with a count and a percentage. Twenty-five percent supported the changes. Twenty-two percent had concerns about implementation. It's tidy and fast, and it looks finished. You could drop it into a briefing note this afternoon.

Here's the problem: a wrong answer and a right answer look almost identical coming out of an AI tool. The output itself gives you little purchase on which one you're holding. That table could be an accurate read of what people said, or it could be quietly off, there's really nothing on the page telling you which.

The gap between how an output looks and what it actually contains is what this series is about.

This is the first of three posts ("Confident and Wrong") about a gap I've come to think is the most important and least discussed problem in how large organizations, such as governments, are adopting AI. It's adapted from a talk I gave at the joint Canadian Conference on AI, Robots & Vision (AI/CRV) held at Simon Fraser University in May 2026.

My team makes sense of public engagement data — the feedback people share in surveys, focus groups, and facilitated sessions. Over the past few years I've watched my corner of government get good at acquiring AI tools, writing policies about them, and tracking whether people use them. What I've seen far less of is anyone asking a harder question: how do you know when one of these tools is working?

Here's what that question looks like in practice, using recent work from my team.

What the AI did, and what it didn't tell us

We use AI to assist with thematic analysis. You take hundreds of written responses, and the tool does a first-pass sort: it reads them, groups them by theme, and hands back a structure you can work from. It's useful, and far faster than doing it by hand.

But when an AI sorts open-ended responses into themes, it's constantly making granularity decisions — judgments about how finely to cut the categories. Should "housing affordability" and "rental costs" be one theme or two? Should "lack of consultation" and "feeling unheard" be merged, or kept apart? These are real analytic decisions, and they shape what the data appears to say.

The AI makes every one of them without flagging that a judgment was made. You get named categories, clean counts, and a structure that looks ready to use. Nothing signals that a dozen consequential calls are baked into it — calls that a different analyst, or even the same tool in a different session, might have made differently.

The trouble isn't that the AI is wrong. It may well have made reasonable choices. The trouble is that the output gives you no visibility into what those choices were. So you can't tell whether the structure in front of you reflects what your data actually contains, or just the level of distinction that happened to make sense to the particular model.

Watch for this: The danger isn't a dramatic error. It's the quiet one. A frustrated, sarcastic response coded as "support" because the surface language sounded positive. Two distinct concerns collapsed into one theme because they kept showing up together. Neither announces itself in the output. And in government, that output often becomes a report, which shapes how findings get characterized, which informs a decision — and the gap never surfaces at any of those stages.

What it took to catch it

The answer wasn't a better prompt. It was a step we now take before the AI touches any responses.

The analyst writes down the specific questions the analysis has to answer — not "what did people say?" but "what decisions will this inform?" Those questions become the standard - we call these simply, "Questions to be Answered" (QBAs). When the AI output comes back, we're not asking whether it looks reasonable. We're asking whether it answers the questions we committed to in advance.

That distinction does the work. Whether something looks reasonable can be satisfied by anything fluent and well-organized, which is what these tools produce by default. Whether it answers your pre-specified question can only be settled by going back to the source text and checking. The demanding version of the question forces the evaluation the clean output is quietly inviting you to skip.

There's a methodological idea underneath this that I find useful. In survey work, a question only earns its place if you can say what you'd do differently depending on the answer. The same test applies to an AI output: before you use it to inform a decision, you should be able to say what you'd do differently depending on what it shows. If you can't, the tool hasn't earned its place — and you have no standard to judge its output against.

Try this before your next AI-assisted analysis: Write down, in advance, the one decision this output is supposed to inform, and what a wrong answer would look like. Not "what will the AI tell me," but "what would I expect to see if this output were misleading, and how would I check?" If you can't answer that before you start, you won't be able to evaluate what comes back. The clean output won't give you the standard. You have to bring it.

The thing that doesn't go away

Using AI didn't remove the analytic thinking from this work. It moved it. The judgment that used to happen while building the themes by hand now has to happen afterward, against an output that looks finished and shows no seams. The thinking didn't disappear. It relocated — and that relocation makes it easier to skip.

That's the pattern I want to trace through this series. In the next post I'll show it from a different seat: not analyzing data after the fact, but using AI to stress-test a survey before it ever goes out. Same mechanism, a stranger set of limits.

BC public servants can read the province's guidance on reviewing generative AI outputs. It's a useful starting point — and in Post 3 I'll come back to where starting points run out.