#ai #image-generation #flux #novel #research

Illustrating a 40-chapter AI novel without losing the characters' faces

What it took to put 367 coherent illustrations into The Butterfly Effect - three model swaps, a failed LoRA, $20 of GPU rental, and a working pipeline.

May 16, 2026 AI

Kael at her workbench, chapter 1. Generated by FLUX.1 Kontext on an H200, anchored to a canonical reference image so she looks like this person in every chapter she appears in.

A few weeks ago I shipped The Butterfly Effect - a 40-chapter sci-fi novel that was authored through a multi-stage AI pipeline. The prose worked. Feedback came back fast: interesting, but on the long side, and please add more images.

That second part turned out to be the harder problem.

This is a writeup of the actual journey: three model swaps, a failed LoRA training run, about $20 of GPU rental, and the pipeline we landed on. The whole thing happened in one long working session with Claude and a couple of remote boxes, so this is also a snapshot of what illustrated AI fiction looks like right now, in May 2026, before any of these tools have settled.

The actual problem

Most “AI image generation” advice is about making one good image. Pick the right prompt, pick the right model, iterate.

Illustrating a novel is a different shape. You don’t need one image. You need 200-400 images of the same characters in the same world across many chapters. The protagonist on page 12 has to be the same person on page 312. Her scar has to be on the same arm. Her hair has to be the same length and texture. If she’s pale-skinned because she grew up in a sealed Antarctic habitat, she has to stay pale-skinned across every frame, even in the chapters where the prompt happens to not mention skin tone.

This is harder than it sounds. Off-the-shelf APIs handle it badly.

Attempt 1: Google Imagen 4 + style reference

The first pass used Google’s Imagen 4 via Vertex AI, with Imagen 3 Customization for style transfer (pass an existing illustration as a style anchor, get new images in the same painterly look).

The good: ~$10 to generate 367 frames. The painterly style transferred well. Continental scenes looked grounded and gritty. Antarctic interiors looked cold and sterile. Both the right vibe for the novel’s central opposition.

The bad: character identity drifted constantly. Sūrya - an Antarctic Advisory Chair who has lived her whole life in sealed habitats and is supposed to be “translucent ivory, almost ghost-white”

came out olive-skinned in chapter 2, brown-skinned in chapter 4, tan in chapter 6.

Three Sūryas from the Imagen pass, all the “same” character:

These are not the same person. They’re three different women who happen to share the same name in the prompt.

The reason is structural, not fixable through prompting. Imagen 3 Customization locks style (brushwork, palette, atmosphere) but it does not lock subject identity. Google’s API has no way to say “this specific person, every time.” You can write “pale South Asian woman with hair pulled back” in every prompt and the model will give you a plausible-but-different woman each time.

I shipped this version as a v0 just to have something readable, but the drift was the central reason to keep going.

Attempt 2: Black Forest Labs FLUX.1 Kontext via API

Black Forest Labs (the team that made the original Stable Diffusion) shipped FLUX.1 Kontext: a model specifically designed to take both a text scene prompt and one or more reference images, and preserve the identity of the reference in the new generation.

You feed it a “canonical” image of Sūrya plus a scene description, and it puts that exact Sūrya in that scene. Independent benchmarks show cosine identity similarity above 0.92 across multiple edits - much higher than the alternatives.

I generated one canonical reference per major character (Sūrya, Kael, Moss in two states - pre and post genetic modification - plus the elderly historian Old Sekani and the Antarctic biologist Kavya), and ran chapter 17 (a Sūrya + Moss medical-bay scene) through Kontext.

The good: it worked. Sūrya was Sūrya. Moss was Moss. Scenes matched the prompts.

The bad: cost. Each Kontext call ran 3 credits base + 1.5 credits per reference image attached. With character + character + world ref, that’s 7.5 credits = $0.075 per image. A full 367-frame batch would have been about $27, which sounds modest until you realize that every prompt fix, every “ugh, regenerate that one” iteration, costs another $0.06. The pricing structure quietly punishes iteration.

I was about to commit to it. Shivam pointed out:

for $20 I can get 4 hours of an H100

He was right. The math just flipped.

Attempt 3: Local FLUX.1 Kontext on a Vast.ai H200

Pivot. I rented an NVIDIA H200 (140GB VRAM) on Vast.ai for $3.88/hour, got SSH access, installed PyTorch + diffusers + the open-weights versions of FLUX.1-dev and FLUX.1-Kontext-dev (Black Forest Labs makes the same models that power their hosted API available for local inference, gated behind a free HuggingFace license click).

About 90 minutes after starting, the same character-locking pipeline was running on hardware we owned for the night.

The canonical reference for each character:

Canonical Sūrya — Sūrya - pale-ivory, mesh-implant scar at left temple

Canonical Kael — Kael - short natural hair, leather tool apron

Moss pre-modification — Moss (pre-mod) - Continental sailor, dark brown skin, broken nose

Moss post-modification — Moss (post-mod) - same person, after Antarctic neural and genetic editing

Each canonical was generated by the same model with a heavy prompt - no reference image. Once we had them, we plugged them in as reference inputs to all subsequent generations. Sūrya in chapter 8 now looks like Sūrya in chapter 36 because they’re both seeded from data/references/surya/canonical-01.png.

The thing that didn’t work: style LoRA training

The Kontext results were locked on identity but slightly clean - more 3D-render than painterly. The plan was to train a small style LoRA from the five original hand-curated illustrations of chapter 1-5 to push the model back toward that aesthetic.

LoRA training crashed three times in a row with CUBLAS_STATUS_INTERNAL_ERROR on the backward pass. The issue is a known incompatibility between PyTorch 2.11 + CUDA 12.8 (what was on the Vast image) and the gradient kernels in ai-toolkit. Forward inference works fine; backward gradient computation hits a broken cuBLAS path.

We could fix this by downgrading PyTorch to 2.5 + CUDA 12.4 or by switching to kohya/sd-scripts as the training stack. Both are real fixes. But the character coherence problem - the actual reason we needed FLUX in the first place - was already solved by Kontext alone. The style LoRA was going to be the icing.

We shipped the version that worked and noted the LoRA as future work. Honesty over completeness.

What the pipeline actually is

Three layers:

A visual bible (apps/novel/data/bible/) - YAML files with stable visual descriptions of every named character, every recurring location, the props, and the per-arc color palette. Pulled automatically from the manuscript.
An image plan (apps/novel/data/plan/image-plan.yaml) - 367 entries, one per planned illustration. Each entry says: which chapter, what scene (anchored to a verbatim phrase from the prose so the renderer knows where to place it), which characters are in frame, which location.
A generator - currently tools/generate-images-local.py. Reads the bible and plan, builds a prompt per image, attaches the right reference photos (character canonicals + a world reference for Continental vs Antarctic scenes), calls FLUX.1 Kontext locally on the H200, saves the PNG.

The generator is backend-swappable - yesterday it called BFL’s API, today it runs FLUX locally, tomorrow it could call something better. Same bible, same plan, different backend. That’s the real product of this work: not the images themselves, but the data layer that survives the next three model generations.

What v1 actually looks like

Four heroes from the FLUX local batch, one per arc:

Chapter 1 hero — Ch 1 - Kael at the workbench, examining a finished lens

Chapter 8 hero — Ch 8 - Sūrya alone in the Antarctic observation chamber

Chapter 22 hero — Ch 22 - The modification chamber, Moss on the diagnostic chair

Chapter 36 hero — Ch 36 - First contact on the dark-sand beach

What’s missing in v1

Identity is locked. That’s real. Sūrya looks like Sūrya in every frame. Kael’s scar is on her left forearm in every chapter she appears. Moss’s amber-streaked eyes are there post-modification, gone pre.

But Shivam’s first reaction after scrolling was the right one: these read as fillers, not narrative. Look at the four frames above. Each character has the same composed expression in every shot. The composition is uniform - mid-shot, character centered, neutral pose. The images show you who and where but not what the scene means.

The cause is structural. FLUX Kontext weights the reference image heavily. Our reference is a neutral composed portrait. So every generation inherits that neutrality, even when the chapter calls for grief, fear, wonder, exhaustion. The text prompt asks for “Sūrya weeping” and Kontext renders “Sūrya, neutral, in a room that suggests weeping happened.”

Solvable. Not solved in v1.

Where this is going

The Butterfly Effect is the first novel. The pipeline is now research infrastructure for the next thing. The roadmap from here is roughly:

Per-emotion canonical refs - not one neutral Sūrya, but Sūrya-neutral, Sūrya-grief, Sūrya-wonder, Sūrya-resolve. Same character, different emotional anchor. The plan picks the right ref for the scene.
Cinematic prompt rewriting - the current scene descriptions are written like “Kael at the gate with Sekani.” A directorial pass turns those into “low angle, golden hour, Kael’s face in shadow, Sekani’s smile catching the last light.” Sounds small, isn’t.
Shot-scale variety - mostly mid-shots today. Mix extreme close on a hand, dutch tilts, over-the-shoulder, landscape wide. The visual rhythm of a real comic / film.
Multimodal-LLM auto-validation - feed every generated image back through Gemini or Claude with the original plan entry: “does this image show what I asked for?” Regenerate the failures.
Style LoRA - finally land the training step (different PyTorch version next time) so the painterly aesthetic stops fighting Kontext’s clean-realism prior.
Captions - one-line LLM-written caption under each image saying what the reader should notice. Bridges the image-to-narrative gap explicitly.
More characters per book, real character LoRAs, scene composition with multiple named subjects in the same frame (still hard for current models), automatic image-plan generation from prose, drift-flagging that triggers regeneration.

The honest version of “AI-generated illustrated novel” right now is: 85-95% identity stability with the best tools, real production cost under $25, and a meaningful amount of human-in-the-loop curation. Not push-button. But not science fiction either.

Go read it: novel.too.foo.

Notes

Process notes, decision logs, and the actual YAML bible+plan live in the repo. The whole pipeline is set up so a future model swap is roughly a one-file change, not a rewrite. If you’re working on something adjacent, ping me.