Small models are eating the big-model market

May 17, 2026 5 min read Singrey

By mid-2026, 7-8B-parameter models run yesterday's GPT-4 class on a laptop or phone. Here's what that quietly changed for a solo developer.

One of the quietest shifts this year: 7-8B-parameter models now do most of the work a GPT-4-class model did two years ago — on a laptop, sometimes on a phone. The curve didn't make headlines, so we missed it. For a solo developer it may be the most important news of the year.

The gap between frontier and "good enough" is shrinking

Two years ago the chasm between a frontier model and a small one was huge. Today the chasm still exists, but the bar for "good enough" has risen so high that frontier models are a luxury for most daily work.

Calling Claude Opus or GPT-5.5 for a blog draft, a simple data transform, a SQL explanation, a form validation — that's a sledgehammer for a walnut. A real alternative existed before but latency and quality pushed people to the big model anyway. Now the choice makes sense.

Open-weights models pulled the curve forward

When I wrote about Kimi K2.6 open-weights multimodal and about GLM-5.1 launching on Huawei Ascend without Nvidia, my hunch was: these models don't make headlines, but they quietly change infrastructure decisions.

In mid-2026 that effect is measurable. A startup's "what do I default to" question is no longer answered by a single closed API. Running an open-weights model on your own server delivers ~95% of the frontier result at one-tenth the cost.

What changed for a solo developer

In practice I live this curve like this: I build a tool chain and I no longer drop a single "best" model into it. I use the expensive model only at the point that actually requires judgment, and a cheap, fast model for everything else.

Result: the same product runs at a tenth of the old cost, with lower latency. And when the small model runs locally there's another bonus — no internet dependency, user data never leaves the device.

"Multi-model" is the 2026 default

I'm writing this because of one realization: in 2024-25 "one model for everything" was normal. In 2026 "the right model for each job" became normal. When I wrote about GPT-5.5 becoming the new ChatGPT default, the trend I pointed at was actually a sign of something larger: in front of the same user, models of different sizes rotate based on the question.

This architecture isn't an abstract debate. I do exactly this in Cubitz: a tiny model classifies the user's 200-character input, a mid-size model decides if needed, and only at the production step does a frontier model get called. Each layer runs the right size for the right job.

Where the trend is heading

A few quick predictions:

• Default 4-8B models running on phone SoCs become standard by 2027 — the Apple Intelligence and Pixel ecosystems are already accelerating this.

• Open-weights small model + tool use becomes a serious alternative to closed frontier models.

• The question "which model do you use" gives way to "how did you architect your model stack."

Singrey's note

For years I assumed going to the "most powerful model" was the right move. In 2026 I noticed something different: the most powerful model usually gives the right answer, but the right answer isn't always what you need. A "good enough answer" delivered at the right speed and the right price is more valuable for a product than anything else. The quiet revolution of small models is exactly what's opening that door.