Choosing an LLM for a production job: a blind, cross-vendor evaluation

The newest, most-hyped model won the one metric I started measuring, then ranked dead last on everything that actually mattered. That sentence is the whole reason I'm writing this up.

I run an automated pipeline that turns a one-line idea ("two cartoon fish have a jealous argument") into a finished vertical video. At its heart is a single LLM call I call the scriptwriter: from the plot and the cast it emits a structured shot-list for every scene, with the first-frame description, the camera motion, the spoken dialogue, the emotional beats, and a list of objects that must stay hidden until a deliberate reveal.

That one call has two properties that make it worth obsessing over:

It's the quality ceiling. Every later stage (image, video, voice) only renders what the scriptwriter decided. Flat dialogue or a vague description can't be rescued downstream.
It runs on every project, so its per-call price is multiplied by the entire workload.

Put those together and the question stops being "which model is best" and becomes "which model gives the most quality per dollar", and whether the cheap incumbent is leaving quality on the table that's worth paying for.

What triggered it

Two complaints about the incumbent, a cheap GPT-5.4-mini. First, a first-frame leak. For a comedic pull-back reveal (two fish argue, then the camera zooms out to show they're in a tiny aquarium, in a house, underwater, with a hawk flying past) the model wrote the reveal straight into the opening frame. Punchline spoiled, first shot cluttered. Second, thin dialogue: functional but flat, often only one character spoke.

First-frame leak: the entire reveal crammed into the opening frame

The bug, visualized: GPT-5.4-mini wrote the whole pull-back reveal into the opening frame — house, room, hawk and the tiny aquarium all at once. The punchline is spoiled before the scene starts.

I could have swapped models on vibes. Instead I ran an eval, because the cost of this call is multiplied across every video I'll ever make.

The method (this is the reusable part)

Fair input. I didn't write a synthetic test prompt. I reconstructed the exact messages the pipeline sends in production: a ~12,000-token system prompt carrying the plot, the characters, and all the authoring rules, plus a tiny user prompt ("generate a shot-list of 1 scene") and a strict JSON output contract. Every model got byte-identical content; only the structured-output mechanism was adapted per vendor.

Four models along the cost axis: the cheap incumbent (GPT-5.4-mini), the newest full GPT (GPT-5.5), a top-tier Claude (Opus 4.8), and a mid-tier Claude (Sonnet 4.6).

Two instruments. A deterministic leak scanner that mechanically inspects the t=0 fields for concealed words, undeclared "intruder" nouns, and reveal phrasing (score: 10 minus penalties). And a blind, cross-vendor judge panel: two judges from different vendors (one Opus, one GPT-5.5) scored the four outputs anonymized, relabeled A/B/C/D, with no judge told which model wrote which or which was its own, across six dimensions: dialogue, description, motion, emotion, first-frame discipline, and coherence.

Two judges from different vendors guard against any single model's taste; anonymizing the inputs stops a judge from flattering its own work.

What happened

Round 1, the narrow leak test. On a fresh sample, all four models scored a clean 10/10, against the real production frame's 0/10. The same incumbent that leaked in production produced a clean frame here, from the same prompt.

That's the first lesson, and it's easy to miss: the bug was stochastic, not a model defect. At a normal sampling temperature the failure is probabilistic. A single run can't separate models on this axis, and no model swap guarantees a clean frame.

Round 2, the full blind evaluation. Because the narrow test saturated, I judged the whole output. Panel totals out of 60:

Claude Opus 4.8 — 49.5
Claude Sonnet 4.6 — 49.0
GPT-5.4-mini (incumbent) — 40.5
GPT-5.5 — 34.0

Both judges independently put the two Claude models on top and GPT-5.5 last. The detail that made me trust it: the GPT-5.5 judge ranked its own vendor's models last. No home-team bias; if anything the opposite.

What separated them was concrete. The top two wrote a real two-way exchange. One fish accuses, the other answers ("I was just eating…", tying the excuse to the worm in its mouth), plus animatable physical comedy ("snaps a fin across the face, the worm flies out"). Both GPT models gave a one-sided accusation and left the second character idle.

And the punchline of the whole study: GPT-5.5, the newest flagship, won the narrow leak metric but came last overall. It kept the first frame clean, then moved the reveal into a different field the scanner didn't check. It passed the proxy and spoiled the gag anyway.

Clean first frame on Sonnet 4.6: only the two characters

The clean first frame on Sonnet 4.6: just the two characters. The worm in the clownfish's mouth is the excuse the dialogue pays off — a story detail, not a leaked reveal.

Optimizing a single metric can actively select the worse product. The only thing that caught it was looking at the whole output, blind.

The quality-vs-cost decision

This is the section the study exists for. Measured on a live call, one scene on Sonnet 4.6 used ~12k input + ~1.3k output tokens; input is dominated by the fixed system prompt, output scales with scene count. As rough per-project ratios for a 6-scene video (re-check current vendor prices before quoting; the ratios are the durable part):

GPT-5.4-mini — ~1× (cheapest), quality 40.5
Sonnet 4.6 — ~15× the incumbent, quality 49.0
Opus 4.8 — ~5× Sonnet, quality 49.5
GPT-5.5 — same tier as Sonnet, quality 34.0

Read it three ways:

Cheapest isn't free. The incumbent is ~15× cheaper, but 8.5 quality points lower, and since the scriptwriter is the ceiling, that gap shows up in every video. The saving is paid back as worse output.
Most expensive isn't worth it here. Opus is nominally #1, but its 0.5-point edge over Sonnet is inside the noise of a two-judge panel, at ~5× the per-token cost, on a call that runs on every project.
The newest flagship is dominated. GPT-5.5 is both worse and not cheaper. There's no operating point where it's the right pick. "Newest and most-hyped" is not the same as "best for the job."

So I shipped Sonnet 4.6, the knee of the curve: roughly Opus quality, well above both GPTs, at about a fifth of Opus's price. Two refinements keep the bill honest. The non-creative helper calls (a validator, a dialogue normalizer) stay on the cheap model since they don't author anything, and the scriptwriter model is a single config constant, so promoting to Opus or demoting back is a one-line change.

And the bug that started it all? A model upgrade lowers the odds of a first-frame leak but can't eliminate a dice roll. The robust fix is deterministic: assemble the opening frame from constrained slots (characters, their pose and emotion, ambient setting only) and strip any reveal or intruder object in code, independent of the model. Paying for a bigger model to buy reliability you can get deterministically for free is just the cheap-model trap inverted.

What I'd want you to take, not take on faith

The honest limits: this ranking rests on one generation per model and two judges. The Opus/Sonnet gap is noise, so treat them as tied. The judges are also players (anonymization and cross-vendor agreement reduce self-preference but don't erase it). It's one scenario, one genre. A stronger study runs N≥10–20 samples per model and reports distributions, not point scores. None of that overturns the directional result, but it bounds how hard you can state it.

The parts that travel, regardless of which vendor wins in your case:

Test on your real prompt, not a synthetic one. Fairness depends on identical, production-true input.
Measure holistically and blind. A single proxy metric will eventually pick the worse product; anonymized, cross-vendor judging catches it.
Price with real tokens and compare quality-per-dollar, not peak quality, especially for a call multiplied across your whole workload.
Separate stochastic bugs from model quality. If a defect comes and goes across runs of the same model, it's probabilistic: fix it in code, don't buy your way out with a bigger model.

---

I'm Dzmitry Harupa, a fractional CTO who builds and ships AI systems hands-on and writes up what the work teaches me, like how I gave my AI agents persistent memory between sessions. If you run evals like this, or want to challenge how I ran this one, tell me. The sharpest objections are the most useful.