A Seat at the Table, Not a Verdict - "Confident & Wrong" Part Two

You've drafted a survey. Before it goes out to the public, you want to know where it'll trip people up: the question that reads clearly to you but lands oddly for someone encountering it cold, the response option you forgot to include, the wording that quietly nudges people toward an answer. Normally you'd find this out by testing the draft with real respondents. But testing costs time and money you don't always have, so you ask an AI to play the respondent instead — adopt a persona, work through the survey as that person would, and flag whatever causes confusion.

It works, up to a point. The trouble is that the point where it stops working doesn't announce itself either.

This is the second of three posts ("Confident and Wrong") adapted from a talk I gave at the joint Canadian Conference on AI, Robots & Vision (AI/CRV) held at Simon Fraser University in May 2026. The first post looked at AI-assisted analysis of survey data after the fact. This one moves to the other end of the process: using AI to stress-test a survey before it ever goes live.

What we ask the personas to do

Before a government survey goes live, it should be tested with real respondents — cognitive interviews, pilot testing, people who actually represent who you're trying to reach. Under tight budgets and compressed timelines, that's often not fully possible. So we've started using AI personas as a complement: a final pass through the drafted survey from a few simulated perspectives before it's programmed and fielded.

We build the personas in advance, grounded in what we know about the people we're trying to reach — a skeptical long-term resident, say, or a newcomer with limited English. Then we ask the AI to work through the survey as that person and flag anything that would make them hesitate, guess, or give an answer that doesn't reflect what they actually think.

What it's good for is fresh eyes at low cost. A persona will surface things I've stopped seeing because I wrote them: an ambiguous phrase, a missing response option, a question that reads fine alone but lands oddly third in a sequence. We don't act on every flag. But it lets us tell the program area we considered an issue and made a deliberate choice, rather than having it slip past us unnoticed.

The limits are structural. They don't go away with a better prompt or a more carefully built persona, and two of them show up reliably enough that they're worth naming.

The persona doesn't stay in character

The first is persona drift. The skeptical respondent starts out answering like a skeptical respondent. Then, several questions in, it starts answering like a helpful one.

The AI tends to flatten toward the kind of response it seems to think you want — agreeable, accommodating, easy. For most tasks that tendency is harmless or even useful. For a stress-test it's close to the opposite of what you need, because the whole point is to simulate someone who won't cooperate smoothly with a poorly worded question.

Once you know this is a feature of the tool rather than a glitch you can prompt away, it changes how you use it. You weigh the early flags more heavily than the late ones. You re-anchor the persona partway through. And you treat the whole output as a set of hypotheses to check, not findings to act on.

Watch for this: The most useful flags often come in the first few questions, before the persona drifts toward agreeableness. If you're reading a persona transcript looking for the strongest signal, don't assume it's evenly distributed. The tool may be most "in character" exactly when you're still warming up to the exercise.

What it can't feel, and what it makes up

The second limit comes in two related forms.

The first is emotional register. An AI persona can approximate what a respondent might not know — an unfamiliar term, a concept outside their experience. What it can't do is simulate how a real person feels about not knowing it. The quiet frustration of someone who feels talked down to doesn't show up as a flag. In a real survey it shows up as a dropout, where the person closes the tab. The AI has no equivalent, so the problem that matters most in practice is the one it's least able to surface.

The second is what I'd call ecologically invalid feedback: concerns that are plausible in the abstract but wrong in context. In one case a persona flagged a question as potentially exclusionary. Reasonable on its face. But when we brought it to the program area, the people who knew the policy and the client population recognized immediately that the concern didn't apply to the population this survey was actually reaching. Acting on it would have made the instrument worse, not better. The flag was coherent, and it was also wrong, in a way nothing about the output revealed.

Both limits point in the same direction: the output has to be checked against knowledge the AI doesn't have and can't get on its own.

The verification layer

What we built in response isn't a fix; there's no prompt that resolves any of this. It's a verification step.

Every issue a persona flags goes to the program area before anyone acts on it. Not because we assume the AI is wrong, but because the question that actually matters — would real respondents in this context experience the question this way? — isn't one the AI can answer reliably on its own. Answering it takes domain knowledge: the policy context, the public expectations, the history of how this group has responded before. The AI might approximate that knowledge, but we can't verify the approximation without going to the people who actually hold it.

The personas do the surfacing. The judgment call stays with us.

Try this with any AI persona output: Before you change anything based on a flag, ask one question — does this concern reflect how real people in this specific context would respond, or only how a plausible-sounding general respondent might? Then take it to someone who knows the population. If you can't name who that someone is, you're not ready to act on the flag yet.

Same mechanism, different seat

This is the same pattern as the first post, viewed from the other end of the work. There, AI relocated the analytic judgment to after the output, against a clean summary that hid its own seams. Here, it relocates the judgment to after the flag, against feedback that sounds reasonable whether or not it holds. In both cases the thinking didn't disappear when the tool arrived. It moved to a place that's easier to overlook, and under time pressure, easier to skip.

So far this has been about what responsible use looks like on one small team. In the final post I'll pull back to the bigger question: why the guidance, policy, and training built to support this kind of work mostly stop short of the part that actually matters.

BC public servants can read the province's guidance on reviewing generative AI outputs. — a useful starting point that I'll return to in Post 3.