A study out of Wharton published earlier this year should be required reading for every technical leader who has deployed AI coding tools and called it a productivity win.
The researchers gave participants a series of questions, some factual, some analytical, and paired them with an AI assistant that was either correct or deliberately wrong. The finding that should stop you cold: when the AI gave a wrong answer confidently, participants accepted it 60% of the time. When the AI was correct, they accepted it 85% of the time.
The accuracy gap between a right AI and a wrong AI was 44 percentage points in terms of participant behavior.
What the research actually found
The Wharton team identified two distinct processing routes. The first they called the "deliberative route": the human reads the AI output, evaluates it against their own knowledge, and makes an independent judgment. The second they called the "autopilot route": the human sees fluent, confident output and accepts it without serious engagement.
The conditions that predict which route gets taken: time pressure, task complexity, and, most importantly, how confident the AI appears.
Fluency triggers trust. Confidence suppresses doubt. The brain's internal signal that would normally say wait, check this never fires, because nothing feels wrong.
They tried to fix it with incentives. Paid participants per correct answer. Gave real-time feedback after every item. The accuracy gap between correct AI and wrong AI: still 44 percentage points.
You cannot incentivize your way out of a broken cognitive architecture.
What it looks like in your engineering org
A developer opens Copilot. Gets a suggestion. It compiles. It passes the linter. It looks right. They commit it.
Did they think about it? For a second, maybe. But the AI was fluent and confident, and the sprint deadline is tomorrow, and the answer felt right.
That's the autopilot route in production. The AI's judgment became the engineer's output, with the engineer's name on the commit.
Now multiply that across ten engineers, forty sprints, and eighteen months of accumulated AI-assisted decisions.
That's your codebase. Not augmented. Not accelerated. Surrendered, one confident, frictionless commit at a time.
The scary part isn't the bad code. It's that nobody felt uncertain while they were writing it.
The org conditions that make it worse
Cognitive surrender doesn't happen equally everywhere. It clusters. And the conditions that produce it are almost always structural, not individual.
Time pressure is the accelerant. The Wharton data shows deliberation collapses under deadline pressure. If your sprints are structured so that engineers are making thirty AI-assisted decisions a day with no slack to question any of them, you have not built a productive team. You have built an autopilot factory.
Fluency is the trigger. Modern LLMs are extraordinarily good at producing output that sounds authoritative. The more confident and well-formed the suggestion, the less likely a human is to interrogate it. This is not a bug in the model. It is a feature that becomes a liability when the model is wrong.
Review culture is the last line of defense, and it's usually broken. Code review in most orgs has become a ceremony, not a check. Reviewers are looking for obvious errors, not evaluating whether the logic is actually sound. When AI-generated code looks clean and passes tests, it almost never gets the scrutiny it deserves. The review that would catch the problem doesn't happen, because nothing triggered the reviewer's doubt either.
The result is a compounding problem. Bad assumptions get reviewed by people who are also on autopilot, approved, merged, and built on top of. Eighteen months later, the codebase has a structural problem nobody can trace back to a single decision, because there wasn't one. There were thousands.
What a technical leader actually does about this
When I walk into an engineering org for the first time, I'm not looking at the roadmap. I'm not looking at the architecture diagram. I'm looking for the conditions that tell me whether the team is thinking or surrendering.
The first two weeks are diagnostic. Here's what I'm actually measuring:
Override rates. When engineers push back on AI suggestions, what happens? Is there a culture of "the tool is usually right, don't fight it", or do engineers feel safe saying "I don't trust this, let me think it through"? Override rate isn't a metric most orgs track. It should be.
Review quality signals. I sit in on code reviews and watch for the ceremony problem. Are reviewers asking "why did you write it this way", or are they scanning for red lines and approving? The difference tells me whether the last line of defense is functioning.
Skeptic seat placement. Every team has engineers who naturally question AI output more than others. Where are they? Are they in senior roles where their skepticism shapes team behavior, or are they junior, outnumbered, and quietly ignored? The distribution of skeptics in your org is a leading indicator of how much cognitive surrender has taken hold.
Pressure architecture. How many AI-assisted decisions is an engineer expected to make in a sprint? What's the slack time for deliberation? If the answer is "there isn't any," the org has structurally eliminated the conditions under which deliberation can happen. No amount of training fixes that.
The diagnosis usually takes two weeks. What I find almost always confirms the same pattern: the tools were deployed, the velocity metrics went up, and nobody asked what was happening to judgment quality in the process.
The uncomfortable conclusion
The question isn't whether your team is using AI. They are.
The question is whether they're using it, or whether it's using them.
The Wharton findings suggest that in most orgs, under most conditions, the autopilot route is the default. Fluent, confident AI output suppresses deliberation. Time pressure eliminates the slack that deliberation requires. Review culture provides the appearance of a check without the substance of one.
The engineers aren't lazy. They're not careless. They're operating exactly as humans operate under the conditions you built for them.
Fixing it isn't a training problem. It's a judgment infrastructure problem. It requires deliberate decisions about how AI tools are deployed, what review actually means, where skeptics sit, and how much pressure the system puts on the humans inside it.
The codebase you have right now is a record of the conditions you built. The codebase you'll have in eighteen months will be a record of the conditions you build next.