Grok Imagine 1.5: testing the new video model — experience and comparison
Grok Imagine 1.5 is out — xAI's fresh video model with native audio and lip-sync. I ran it through my pipeline right away on a scene I deliberately keep difficult, and lined it up against the models I usually pick between: Seedance 2.0, Kling v3 and Veo 3.1. What follows is my experience with Grok and an honest comparison.
Context from the first part: there I was choosing a scriptwriter — an LLM call that turns an idea into a shot list. The scriptwriter only describes the frame in words; the next layer draws it — the video model. That's what I'm testing here.
The scene and how I ran it
The scene is silly and deliberately hard. Two cartoon fish in Pixar style: a pink girl and a blue boy. She accuses him of cheating, slaps him in the face with her fin — a worm flies out of his mouth — and bursts into floods of tears. The camera pulls back, and it turns out the two of them are in a tiny aquarium on a shelf in a dry room. The lines are in English, the audio and lip-sync are native to the model, 10 seconds, vertical.
Why this one. The scene hits several sore spots at once:
- A two-character dialogue, where the second one has to answer instead of standing there like a post.
- A physical action that's easy to read wrong: a slap versus a kiss.
- An emotion you should hear, not just see.
- A late twist — the camera pull-back that flips the whole scene.
If a model can pull this off, it can pull off almost anything. The input is the same for all of them: same prompt, same starting frame, judged by eye — mine.
All four clips are in English, and that's on purpose. English is the fair common baseline every model handles, so the test measures motion, camera and audio rather than who learned the language best. Russian is a separate matter, and I'll come back to it below.
And to be honest up front: this is one comedic scene, and generation is stochastic — the same prompt gives a slightly different take every time. So it's a reference point, not a measurement. More "on this kind of material I'd go with that one" than exact percentages.
Grok Imagine 1.5 — the one I'm testing
Grok Imagine 1.5: 480p at $0.08/s, native audio and lip-sync.
The cheapest of the four: 480p at eight cents a second. It played the beat the way it should — the slap, the second fish's reply, the crying. The twist works, but on one condition: you have to explicitly write the dry room, otherwise Grok surrounds the aquarium with… more water. With anchors — shelf, window, toys — the pull-back lands in a real room.
Native audio with lip-sync out of the box, and the price is laughable. For me it's the golden mean — enough quality for most tasks, at a price I don't mind paying per project. I reckon you can migrate to Grok from Kling: it covers the same ground, cheaper and with its own pluses.
And separately, on Russian, jumping ahead: of all four, Grok voices it best — clean, no foreign accent. For Russian clips it's my pick.
Seedance 2.0 — best motion
Seedance 2.0: best motion on this clip, strong English audio.
The best motion of all four on this clip — the slap, the flying worm, the physics of comedy. Its detail and emotions are stronger than Kling's too. The English audio is good. But there are two downsides. First: Seedance sometimes takes the prompt too literally, and then the scene looks unnatural — Kling is smoother at that. Second: its Russian isn't great anymore (more on that below). You can't set multiple characters via separate references — it works from the starting frame. The ceiling is 15 seconds.
On the picture it beat Grok here. If the scene is in English and motion matters in it — this is the first candidate.
Kling v3 — camera and multi-character
Kling v3: best camera, multiple characters, up to 15 seconds — but the tears came out so-so.
The best camera and atmosphere of the four. It can handle multiple characters via elements, accepts its own speech format [character, tone], and stretches to 15 seconds. Kling itself is already a touch dated, but on quality it's still very strong. The downside — the subtle effects came out less natural: the tears didn't look great. And its Russian: it technically supports it, but pronounces it with a noticeable foreign accent, unusable for voiceover.
This is the model I go to for a cinematic camera, multiple characters, or a clip longer than ten seconds.
Veo 3.1 (fast) — not for this
Veo 3.1 (fast): motion 3/10, ceiling 8 seconds, strict filter, most expensive of all.
It didn't manage here. Motion — 3/10, the twist is ugly. An 8-second ceiling, a strict content filter, and the most expensive of the lot. For fast comedic motion and big camera pull-backs — a miss. It might come into its own on calm, realistic shots without violence, but that's already different material.
Where Grok ended up
There's no single winner — it all depends on the content. That, in fact, is why in the pipeline the model choice sits with each project rather than being hard-wired once for everything. Where Grok Imagine 1.5 landed and where it gets beaten:
- English comedy with one or two characters and lines — Seedance or Grok: best lip-sync and effects. Seedance wins on detail and emotion (but sometimes takes things too literally), Grok is the cheap and reliable default.
- Russian dialogue — Grok, hands down: it sounds clean. Kling and Seedance both have problems with Russian — a noticeable foreign accent, unusable for voiceover; Grok sounds clean.
- Cinematic camera, atmosphere, two or three characters, longer than ten seconds — Kling: slightly dated, but the quality is still strong.
- Veo — not for fast comedy and not for pull-back twists. It has its own use cases, not these.
And on top of it all — the lesson from the first part: this call runs on every project, so I look at quality per dollar, not at the absolute maximum. Grok at eight cents a second as the baseline model is hard to argue with.
Why the scene worked at all
The most useful thing here isn't even the models themselves, but the prompt that makes any of them play the beat. Here's what I took away.
A slap, not a kiss. The model fills in the ambiguity with its prior: two faces side by side plus "reaches toward" — and you get a kiss. I had to spell it out directly: "slaps the cheek, head whips to the side, NOT a kiss." And subtler still — the character description has to be about appearance only. One "slight smile, calmly swimming" in the character anchor, and an angry slap renders as a tender lean-in.
The second character has to react. The blue fish's bewildered "What?! There's no one else here!" is the punchline itself, and it's also what triggers the twist. A silent, motionless second character looks dead.
The audio has to be named with a word. "Wails out loud, big cartoon tears," not "cries" — otherwise the model's native audio may stay silent.
One twist per ten seconds. When I crammed everything into the clip at once — the accusation, the slap, the tears, the aquarium, the table, the room, the window, the house and a hawk flying past — the model mixed up the order: the hawk ended up inside the room. One late pull-back in its own time window (5–10s) works. Multi-step twists are already several separate scenes.
A twist needs dry anchors. From an underwater starting frame the model surrounds the aquarium with more water by default. You have to force it: "water ONLY inside the glass, a dry bedroom outside" plus concrete objects — shelf, toys, window. With anchors the pull-back lands in the room, without them it stays underwater.
It's stochastic, re-roll it. On the very same prompt the slap came out sometimes crisp, sometimes limp, sometimes not at all. The prompt sets a distribution, not a result. Don't chase determinism with wording — it's easier to re-roll the take. And when you need a guarantee (like with the leaky first frame from the first part) — that comes from deterministic code, not a bigger model.
In short
If I boil it right down — here's my breakdown by aspect (scores 1–5, subjective and on this material):
| Aspect | Grok | Seedance | Kling | Veo |
|---|---|---|---|---|
| Motion | 4 | 5 | 4 | 2 |
| Detail and emotions | 3 | 5 | 4 | 2 |
| Camera | 3 | 4 | 5 | 2 |
| Naturalness | 4 | 3 | 5 | 2 |
| Audio and lip-sync | 5 | 4 | 3 | 2 |
| Multilingual | 5 | 2 | 2 | — |
| Multi-character | — | — | ✓ | — |
| Max length | 15s | 15s | 15s | 8s |
| Price | cheap | mid | mid | expensive |
So, on Grok Imagine 1.5: for me it's the golden mean — native audio, lip-sync, eight cents a second — and a worthy replacement for Kling that I'm leaning toward. On the English scene Seedance beat it on motion and gives better detail and emotions, but it sometimes takes things too literally, and Grok's price tips the balance. On Russian Grok is simply the best: Kling and Seedance both have problems with it. Beyond that it's down to content: English comedy — Seedance; camera and multiple characters — Kling (slightly dated, but still strong); Veo I've set aside for now. There's no single winner, which is why the model choice in the pipeline is tied to the project.