Why the Checklist isn't Enough - "Confident & Wrong" Part Three
A piece of guidance crosses your desk. It tells you, as the person using an AI tool, to check the output for accuracy and bias before you rely on it. It even gives you questions to ask: are some perspectives portrayed more favourably than others? Are any important viewpoints missing? Does the wording lean on assumptions about a group? These are good questions. They reflect real awareness of how these tools go wrong. Then comes the line that does the work: "... if the output feels biased, do not use the content."
Read that again. "If it feels biased." That's where the guidance hands the hardest part back to you and walks away.
This is the last of three posts ("Confident and Wrong") adapted from a talk I gave at the joint Canadian Conference on AI, Robots & Vision (AI/CRV) held at Simon Fraser University in May 2026. The first two looked at what careful AI use took on my own team — analyzing survey responses after the fact (Post 1), and stress-testing a survey before it went live (Post 2). This one steps back to the system those examples sit inside.
In both of the earlier examples, the same thing happened. The AI produced something that looked finished and gave no sign of where it might be wrong, and catching the problem took an evaluative standard the output itself couldn't supply. My team had to build that standard ourselves: pre-specified questions before the analysis began, verification after every output, a second analyst reading independently, consultation with the program area before acting on anything a persona flagged, etc. None of it shows up in an adoption metric. All of it is what using these tools responsibly actually required.
So when I read the guidance, the policy, and the training that's supposed to support this kind of work, I'm reading it against what the work turned out to demand. And the same gap keeps appearing.
Three layers, one gap
The gap reproduces itself at every level I've looked at.
The principles name the right values: transparency, accountability, reliability, human oversight. They're sound. What they don't say is what enacting any of them requires of the person doing it. "Proactive human oversight" appears as a value, described as monitoring outputs for accuracy. But monitoring for accuracy against what? What's the independent standard, and how would you know if your monitoring caught anything?
The operational guidance goes one level down and does the same thing in a sharper way. It tells employees they must review outputs to make sure they're accurate, complete, and current. That's a requirement, stated as though the capacity to meet it is already there. It tells you that you must evaluate without telling you how, or what it takes to do it well. The "feels biased" line is this exact move in miniature: a real problem, named correctly, handed back to the reader as an instinct to follow rather than a method to apply.
The training, in my experience of what's been available to me, builds enthusiasm for adoption — how to prompt, how to find use cases, how to get comfortable. I've completed internal and external offerings, basic and advanced. In none of them was there a serious attempt to assess whether participants could actually evaluate an AI output, as opposed to produce one. Completion was the measure. And this reflects an infrastructure based on compliance, not on competency.
Each layer assumes the layer below it has handled the part it skipped. It's one gap, reproduced three times.
There's a difference between guidance that uses operational-sounding language and guidance that's actually operational. "Review the output critically" sounds like an instruction. But if it doesn't tell you what to compare the output against, or how you'd know your review worked, it's naming a value, not building a capacity. The tell is whether you could hand it to two people and expect them to do the same thing. Most of what I've read, you couldn't.
Why the gap holds
It would be easy to read this as carelessness, or as individual documents falling short. It isn't, and that framing misses what's actually going on.
In the public strategies and guidance I've reviewed, at both the provincial and federal level, success is measured mostly by adoption: whether tools are procured, deployed, and used. There's nothing wrong with that as far as it goes. Procurement matters, and you have to acquire a tool before anyone can use it. But acquiring a tool and building the capacity to evaluate it are different investments, and only one of them is being explicitly made. Adoption is counted. Evaluative capacity isn't measured, isn't funded as its own line, and in practice isn't anyone's defined job.
Once you see it as a question of what gets measured, the persistence makes sense. Organizations build the capacity they track. Adoption is legible, it's easier to put a number on a deployment, a usage rate, a count of trained staff. Evaluative capacity is harder to see and harder to score, so it tends not to get built, not because anyone decided it didn't matter but because nothing in the system is set up to notice its absence. And the absence is quiet. It doesn't announce itself, any more than the miscoded response or the drifting persona did.
What I'd want guidance to grow into
I don't think the publicly available guidance documents are wrong. The principles are right, the review questions are reasonable, and somewhere in the provincial guidance there's even a genuinely good piece of methodological reasoning about not asking AI to summarize survey responses when you can ask it to show you the responses and summarize them yourself. The instinct is there. What's missing is the follow-through — the hard, unglamorous part where you specify what evaluating an output actually involves and treat that as a skill to be built and maintained, not a box to be checked once.
That's the part my team had to work out on our own. I'd like to see it become part of what guidance like this offers, rather than the part it leaves to whoever happens to think of it.
The most important AI question for government isn't which tools to adopt. It's whether we've built the capacity to evaluate them — to know, independent of how an output looks, whether it's working. That's a methodology problem before it's a technology problem. And in the strategies and guidance I've read, the question of how you'd actually know doesn't have an answer. Which means, in practice, it isn't yet anyone's job. Until that changes, having the checklist will keep getting mistaken for having the capacity.
Three questions worth sitting with
When I want to think clearly about this, I don't reach for the newest books about AI. The questions that matter are older than the current conversation, and three books that aren't about AI at all have shaped how I read all of this.
The first question is whose standards of evidence did this model learn? Catherine D'Ignazio and Lauren Klein's Data Feminism (2020) is good on the way a clean output carries an authority it hasn't earned — on how data that looks neutral always reflects choices someone made about what counts.
The second is who bears the cost when an AI output is wrong, and would they know? Cathy O'Neil's Weapons of Math Destruction (2016) sits with the people on the receiving end of automated decisions, the ones least able to see the error and least equipped to contest it.
The third is does our guidance support expert judgment, or substitute for it? Atul Gawande's The Checklist Manifesto (2009) is the one that looks like it argues against me, and doesn't. Gawande's checklists work because they're built by experts to support judgment under pressure, not to replace the expertise that makes them useful. A checklist handed to someone without the underlying competency isn't the same tool. That is the argument of this series, in one book.
None of these are about artificial intelligence. That's the point. The work of deciding whether to trust what a tool tells you is old work, and we already know more about how to do it than the current conversation lets on.
BC public servants can read the province's guidance on reviewing generative AI outputs. Read it as a starting point, and notice where it asks you to bring a standard it doesn't give you.