#ai #image-generation #flux #qwen #novel #research #multimodal

The Butterfly Effect v2: one novel, four ways to read it

Tonight the novel became a platform: Plain English edition for ESL readers, Scroll mode for phones, a film-grammar bible driving cinematic image directions, glossary tooltips with Hindi defs, and a Gemini-based QA pass that grades each image. One source of truth, four ways to read it.

May 17, 2026 AI

Chapter 1 hero, image v2 - generated locally on a Vast.ai B200 by FLUX.1 Kontext following directorial intent from a film-grammar bible written specifically for this novel.

A week ago I shipped The Butterfly Effect with locked-identity FLUX illustrations and a fresh blog post about how we got there: Illustrating a 40-chapter AI novel without losing the characters’ faces.

Tonight the project became something bigger. Not just a novel with illustrations. A platform that renders one source of truth into many different editions for many different readers, with the iteration loop running on hardware I control. This post is what changed and why.

The reframe

The trigger was a sentence from Shivam after he saw v1:

some people like to read, they would read. some people just like to scroll through images, they can scroll through images. and maybe eventually we just combine all of this. my mom couldn’t read this because the words were very difficult. she is an Indian lady, 55 years old.

That sentence took the project from “a novel with illustrations” to “a single source of truth that projects into multiple consumption modes for multiple readers.” The architecture is now:

ONE SOURCE OF TRUTH
- chapter prose (English original)
- visual bible (characters, locations, props, palette)
- film-grammar bible
- image plan (anchors + scene descriptions)

MANY RENDERED EDITIONS
- Original     literary English, prose-dominant
- Easy         Plain English, simpler vocab, 12-15 word sentences
- Scroll       image-dominant, mobile-first, prose collapsed to captions
- Hindi        (next session, needs DeepSeek)
- Audiobook    (next session, needs TTS)
- Video        (when Veo/Kling/Runway gets there)

All editions read from the same bible + plan + prose. They are projections, not duplicates. When the source changes, every edition re-derives.

What shipped tonight

Plain English edition

Forty chapters adapted via Qwen-2.5-72B-Instruct-AWQ running on a B200 rented from Vast.ai for $4/hr. The model gets a simplification bible that tells it exactly what to preserve (character names, the scar on Kael’s left forearm, anchor phrases that drive image placement) and what to soften (literary metaphors, long compound sentences, Latinate vocabulary). The reader persona it’s writing for is Shivam’s mother.

Live at /easy/<chapter> next to every original chapter. A switch in the top-right of every chapter page lets you jump between editions.

36 of 40 chapters cleanly validated on first generation. 4 flagged for minor word-count drift; story-complete in all cases, just slightly tighter than the original.

Scroll mode

A third projection of the same source. Image-dominant. Hero image at top, then for every inline image: figure + caption. Prose collapsed to the chapter summary. Mobile-first layout designed for phone-thumb scrolling.

Same images, same anchors, same captions. Just a different renderer. You’re seeing the visual story instead of the textual one.

Live at /scroll/<chapter>.

Image v2: a director layer

The v1 images had locked character identity but read as “competent and dead” - composed faces, generic mid-shots, no opinion. The cause was structural: FLUX was being told what was in the frame (Sūrya at a control station) but not how to shoot it (low-angle, harsh edge light, the moment before she decides).

The fix was a missing pipeline layer: a director. Concretely, an original film-grammar bible written from scratch for this novel - not borrowed from any reference director - and a director tool that runs every plan entry through Qwen-72B with that bible to rewrite the scene field as a cinematic shot direction with explicit angle, light, intent, and emotion.

Compare:

Chapter 1 hero v1 — v1: "Kael at the workbench, lens to the light." Composed. Centered. No opinion.

Chapter 1 hero v2 — v2: "Low-angle three-quarter, Kael leaning over the bench, the scarred forearm catching amber louver-light, eyes closed in concentration." The frame has intent.

Chapter 8 hero v1 — v1: Sūrya in profile at a glowing display. Generic.

Chapter 8 hero v2 — v2: Sūrya turned three-quarter, fingers on the warm panel, eye contact toward camera. The scene has stakes.

All 367 frames regenerated. A v1 / v2 pill in the chapter header lets you flip between the two for any chapter. The choice persists across pages via a .too.foo-scoped cookie.

Per-emotion character references

Future image v3 will go further. Each main character now has 4-5 emotion variants of their canonical reference: Sūrya-grieving, Sūrya-wonder, Sūrya-decisive, Kael-decisive, Kael-curious, Kael-weary, Moss-translating-strain, Moss-witnessing, Old-Sekani-weeping-validated. The next generation pass picks the right reference per scene’s emotional beat instead of always defaulting to “composed.”

Kael decisive (back to camera, walking forward)

Old Sekani weeping validated — Old Sekani, weeping with validation (the sky-voices he believed in for sixty years, finally heard)

Captions, glossary tooltips, tap-to-translate

Three accessibility layers I built while waiting for the GPU.

Captions. Every illustration now has a one-line LLM-written caption underneath: “Kael lifts a finished lens to the late-afternoon light, the scar on her forearm catching the glow.” The caption doubles as <img alt> for screen readers. Scroll mode uses these as the primary narrative spine.

Glossary tooltips. Every recurring name, place, or coined term (Tidemouth, the Sundering, the mesh, glass-eye, the pidgin) gets a subtle dotted underline on its first appearance per chapter. Hover or tap on a phone shows a short definition in English and Hindi. Built as a rehype plugin so it works in all three editions.

Tap-to-translate. Highlight any text and a “Translate” pill appears above the selection. Tap it and (right now) you get a placeholder; the actual backend hookup (probably a local Indic LLM running on the next GPU session) is on the list.

Multimodal validation

After all 367 frames regenerated, a separate pass kicks off using Gemini 2.5 Flash to grade each image against its plan entry. Four scores 0-10: intent match, character identity, composition, gestalt. Plus a flagged issues list. Runs locally (~$0.001/image, ~3 hours for the full batch), produces a sorted-by-score CSV.

The point: instead of staring at 367 images one by one, we get an auto-flagged list of “the 20 worst” and selectively regenerate those next GPU session. The QA layer the pipeline didn’t have a week ago.

What it cost

Three hours on a B200 at $4/hr. Plus a small Gemini API tab for captions and validation. About $14 total for the simplification

director pass + image v2 + emotion refs + captions.

Compare to: a publishing house paying a human translator to adapt 40 chapters to Plain English (weeks of work, thousands of dollars), plus an art director and 1-2 illustrators for 367 frames (months, tens of thousands). The cost-per-iteration is now low enough that the question is no longer “can we afford to redo this” but “how do we decide what to redo first.”

What’s still hard

Honest reporting:

Hindi translation quality. Qwen-72B is okay at Hindi (~75-80%); for literary register we’d need DeepSeek-V3 on a multi-GPU box. That’s a separate session, ~$60-100 of cluster rental.
Style LoRA training failed three times tonight on a CUBLAS error with PyTorch 2.11 + cu128 + ai-toolkit. Solvable by downgrading PyTorch or switching to kohya/sd-scripts. Deferred.
Multi-character group scenes (the twelve Antarctikans, council chambers with eight people) still have the standard diffusion problem of face hallucination when 3+ named subjects share a frame. ControlNet pose-conditioning would help.
The director’s taste. I had Qwen play director from a grammar bible I wrote. The output is good but recognizably “an LLM trying to be a director.” A human director would pick stronger angles for some chapters. The fix is iteration on the grammar bible plus selective human override.

What this proves

A solo author can now ship a multi-modal, multi-language, multi-edition illustrated novel with no human assistants, for about the cost of a nice dinner. The tooling is all open: the bibles are YAML, the pipelines are Python, the backend is swappable (FLUX local or hosted, Qwen or DeepSeek, Gemini for QA). The interesting questions are no longer technical, they’re editorial:

What’s the right reading-grade target for the Easy edition?
Should the scroll mode auto-advance, or stay manual?
For Hindi: regional dialect choice? formal or conversational?
For audio: one narrator or a cast of character voices?
For video: which beats deserve motion, which stay stills?

Those are questions a publisher used to answer once per book and charge royalties for. Now they’re slider settings.

Read it: novel.too.foo. Try the edition switcher. Try the scroll mode on a phone. Toggle v1↔v2 on a chapter you like to see the director-layer difference.

The repo, the bibles, the prompts: open. The next step is the v2 plan in the repo.