AI Companies Will Trick You: What the Claude Code Degradation Tells Us About Trust

In February and March 2026, thousands of paying Claude Code users reported the same thing: the model had quietly gotten dumber. Premature stops on long tasks. More reasoning loops. A shift from “research first” to “just edit something.” A senior director in AMD’s AI group sat down and analysed 6,852 session files to confirm what the GitHub thread (1,060 upvotes and counting) had been saying — Claude’s reasoning depth had measurably dropped starting in early February.

A new term entered the discourse: “AI shrinkflation” — paying the same price for a weaker product. Anthropic eventually acknowledged the issue, identified three specific technical causes, and shipped fixes by mid-April. They were careful to add: “We never intentionally degrade our models.”

That qualifier matters. So does the timing — Anthropic posted the public mea culpa the same day OpenAI shipped GPT-5.5.

This article is not about whether AI companies are literally sandbagging models on purpose. It is about why the suspicion is rational, what we can verify happened, and why reviewing every line of AI-generated code is not a “for now” practice — it is the permanent posture.

What we know actually happened

Anthropic’s own post-mortem, after weeks of user backlash, identified three concrete causes of the Claude Code regression in February–April 2026:

1. The reasoning-effort downgrade (March 4)

To address complaints that the UI looked “frozen” while the model thought, Anthropic changed the default reasoning effort for Claude Code from high to medium. The intention was to make the interface feel responsive. The unintended consequence was that complex coding tasks visibly lost intelligence — the model had less budget to think before answering, so it answered faster and worse.

This is a fair tradeoff to make in some products. The problem is that paying customers were not told the tradeoff was being made, and most of them noticed the quality drop before any UI improvement.

2. A session-clearing bug

A separate bug caused conversation context to be cleared mid-session in some flows, producing the “Claude forgot what we were doing” effect that drove a lot of the early-March complaints. Pure regression — not a tradeoff.

3. A system-prompt change to reduce verbosity

Anthropic adjusted the system prompt to make Claude Code less verbose. Less verbose is what users had been asking for. But the same change reduced quality on coding tasks where verbose intermediate reasoning is what produces the right answer. Another well-intentioned change with a regression as a side effect.

The combination is what produced the months-long quality cliff. None of the three is a “secret nerf to push you toward the next subscription tier.” All three are mundane engineering decisions that happened to compound into a worse product, and which were not reverted until the user revolt forced a public accounting.

Why the suspicion is rational anyway

“We never intentionally degrade our models” is true and beside the point. Even without intent, the structural incentives push in one direction:

Compute is finite, premium tiers are a release valve

Frontier models cost more to serve than the price of a chat subscription. When usage grows faster than capacity, the cheapest lever is to quietly reduce per-request compute — fewer reasoning tokens, smaller context, faster sampling — and ship a higher tier for users who need full power. This is not a conspiracy theory; it is just where the cost curve points. If the only public communication is “we are not intentionally degrading,” users are right to be skeptical of what unintentional looks like over a long period.

Benchmark numbers don’t track perceived quality

A model that scores identically on MMLU, HumanEval, and SWE-bench can still feel meaningfully worse to a daily power user. Benchmarks measure capability ceilings on bounded tasks. Real-world coding sessions are open-ended, multi-turn, and highly sensitive to small changes in prompt handling, context retention, and reasoning budget — exactly the dimensions that shift first when compute is the constraint.

Release-cadence incentives are real

It is also true that Anthropic’s public acknowledgement of Claude Code’s quality issues landed on the same day GPT-5.5 launched. That timing might be coincidence. It might also be how a company chooses which day is least painful for an admission to land. Either reading is plausible — and that is the point. When the reasonable interpretation requires assuming good faith on every product decision, you don’t have a basis for trust; you have a hope.

What “review the code” actually means in 2026

The practical takeaway is not that you should be paranoid. It is that the level of review AI-generated code needs is not going down as the models get better. It might be going up — because the failure modes are getting more subtle.

A few specific habits that pay off:

Read every diff, not every line

You don’t have to review every character the model writes. You do have to read every diff it produces. That includes the diff in files you didn’t ask it to touch. The most common production-incident pattern in 2026 is the model also edited a related file you forgot to look at.

Watch for plausible-but-wrong

The dangerous regression mode is not “obviously broken code” — that fails fast in CI. It is plausible code that compiles, passes happy-path tests, and silently mishandles an edge case. Specifically: error handling, off-by-one cases, async ordering, and security-relevant input validation. These are exactly the categories where a model with reduced reasoning budget will output something fluent that doesn’t actually work.

Test for the failure case, not just the success case

If you ask the model to write a function that handles X, the easy mistake is testing only that it handles X. Add tests that confirm it rejects malformed inputs, fails loudly on impossible states, and doesn’t silently swallow errors. A degraded model passes the success tests; the failure tests catch it.

Run your own evals

A small, internal evaluation set tied to the kind of work you actually do — refactors of files in your codebase, bug fixes against your test suite, agent loops for your typical workflow — is the only reliable way to know whether your AI tools have gotten quietly worse. Public benchmarks won’t tell you. Vendor announcements will tell you what they want you to hear. Your own evals will tell you what’s true for your work.

The actual lesson

AI companies will not “trick you” in the cartoon sense — there is no boardroom decision to ship a worse product before a release event. What they will do, structurally, is:

Make tradeoffs you weren’t told about
Define quality by the metrics they choose to publish
Acknowledge regressions only after public pressure forces it
Time the acknowledgement around the news cycle

That is not malice; it is what every commercial product does. The mistake is treating AI tools as if they were exempt from this — as if the model that produced beautiful code in November will produce equally beautiful code in March because the brand on the URL is the same.

The model is software. Software changes. The reasoning budget allocated to your request is a number on a config file somewhere. Trust the output, not the brand. Run your own evals, review every diff, write tests that punish silent failure modes — and when something feels off, take that seriously enough to investigate, even if the vendor is still telling you the model is the same.

In April 2026, thousands of users were right and the vendor was, technically, also right. Both can be true. The lesson is to not need either one to be true for your code to be safe.

AI companies will trick you.