The silent AI bug: code I shipped because it "worked" but was wrong

May 24, 2026 6 min read Singrey

Code written with AI has a new kind of bug: it compiles, passes tests, looks right, but behaves wrong. In mid-2026 this is the sneakiest enemy of the solo developer.

Last week I was making a small change in Cubitz. On the first try, AI wrote a function that looked perfectly clean. Lint passed, tests passed, manual check looked fine. Two days later, after a user report, I looked again: the code was behaving wrong — it only returned the right answer for scenarios that matched what I had seen.

This is a new kind of bug. Not the old "AI is making things up" class. Sneakier: AI writes code that looks consistent, looks logical, looks right — but is wrong. I'm calling it the silent bug.

The old bug: noisy and visible

In 2023-24 the classic AI-written bug was noisy: wrong API use, made-up library, syntax that wouldn't run. The bug screamed the moment it happened — compiler error, test failure, crash on run. Easy to catch.

Those still exist but the rate dropped sharply. As models got better, syntax-level bugs faded and library hallucinations almost vanished.

The new bug: silent and reasonable

The new bug's signature is different:

• The code compiles — no syntax errors.

• The code passes tests — because AI wrote tests against its own code.

• The code looks logical — read it and you say "yes, that's right."

• But the code returns wrong answers — only in some edge cases, sometimes in production, sometimes for a specific user type.

The root cause in my Cubitz case: AI wrote a date comparison function. Test data was in UTC and tests passed. In production, the user's browser was in a different timezone and the result was off by one day. I read what the AI did, it made sense, I accepted it. It was wrong.

Why this bug is new

Three things colliding produce the silent bug:

• The model now writes so well that reading the code gives you "I'd have written it this way too." Doubt becomes hard.

• AI test generation also went mainstream — AI writes the code and the test. Both come out of the same mental model and miss the same gap.

• Vibe coding culture accepts fast. "Looks like it works, moved on" happens much more than last year.

Where the compiler and the tests both look away, the silent bug stays.

What I'm doing (still not perfect)

Things I've tried in the last few weeks to reduce it:

• I write the tests, not the AI. AI writes the code; I think through the edge cases. Let AI ask me "what do you expect for this input?" and I supply the answer.

• Smoke test with production data. Before deploying a new function I manually run 5-10 examples from real production data. Test data tends to be ideal; real data is messy.

• Second pair of eyes from a different model. If Claude wrote it, let GPT review. Two different models miss the same bug less often than one.

• "Why might this be wrong?" question. Instead of "verifying" the code, try to break it. The mental jiu-jitsu isn't easy, but it catches silent bugs better than anything.

These practices line up with the boundary I drew while delegating GPT-5.5: verifiable tasks to AI, judgment to me. But "verification" needs redefining in 2026 — "tests passed, compiled, seems to run" isn't enough anymore.

The actual trade-off

There's a way to avoid silent bugs entirely: read every line AI writes the way you used to. But that throws away the speed AI gives you.

My balance: in critical paths (payment, security, user data, computation) the cost of a silent bug is very high, so manual reading + adversarial testing is mandatory there. For look, content, small helper functions speed wins, the bug is acceptable, I can fix fast.

The agent vs assistant distinction and the "how much control on which task" question I covered are exactly the calls behind this trade-off. How much you hand to AI is one-to-one with what a silent bug would cost in that task.

The next stretch

Will the silent bug get solved? Partially. My guess:

• Models will learn to generate adversarial tests — not just the happy path, but ones that stress failure scenarios. Fast progress likely here.

• Combined static analysis + AI tools spread. Some silent bugs are things the type system can catch.

• But "semantic correctness" — is the code doing the thing it's supposed to — stays a human job for a while.

So the problem shrinks; it doesn't vanish. The habit a solo developer should build in 2026-27 is the reflex of "if AI wrote it, read it once more with suspicion."

Singrey's note

Those two days I noticed the bug late in Cubitz taught me a lot — first of all that "the code looks right" is now a new kind of trap. AI's old bugs were noisy and easy to spot; today's are quiet and polite. Silent bugs cost more in the long run than noisy ones. When I remember that, my hand pauses before it hits the keyboard — that tiny pause is the best defense I've got right now.