#openclaw #ai-agents #discord #local-llm #ollama #qwen #kimi #cost-optimization

Krishna on a Budget: Cutting an AI Agent's Bill from $100 to $15

Krishna is my Discord agent. He was costing $100/month on a flat-traffic schedule. Pulling the JSONLs found a 9-day-old session that never rotated. Here is the structural fix: session rotation, capped compaction, a smart-routing proxy, and a 4070 SUPER finally doing some work.

Krishna is the Discord agent on the OpenClaw runtime sitting in front of my workshop. He answers questions, holds short conversations, runs little health checks, and occasionally pulls a fact off the web. He is not, by any normal definition, a heavy user. So when the Moonshot bill landed at roughly $100 for the month, I knew something was off.

This post walks through finding the leak, fixing it, and lining up enough guardrails that the next runaway is visible within a day instead of a month. The interesting part is not any single trick. It is how the same money was being wasted in three layers at once, and how each layer needed a different style of fix.

What Krishna actually does

The whole point of giving Krishna a Discord identity was to have a low-friction way to talk to the workshop. Heartbeats every thirty minutes, the occasional “did the deploy finish” question, a handful of routine asks per day. Light traffic by any reasonable measure.

Under the hood, OpenClaw routes each Discord channel to a sessionKey, and each sessionKey keeps a long-running JSONL transcript. Models, skills, tools, memory, all wired up. The default model was Kimi K2.6 via Moonshot’s API, pay-per-token. Cheap per call. Predictable, I assumed.

The bill, and the suspicious distribution

The session JSONLs were the right place to look. One quick walk of ~/.openclaw/agents/krishna/sessions/*.jsonl for the last week showed something weird:

day        sessions   biggest file
2026-05-09     2      2.9 MB / 1120 messages
2026-05-08     1      175 KB
2026-05-07     1      399 KB
2026-05-05     1      239 KB
2026-05-04     1      426 KB
earlier        rest   50-300 KB typical

A 2.9 MB session on one day. With 1120 messages. The other days look about right for the traffic I would expect.

I pulled the timestamps on that file and got the answer: the session started on April 30 and was still appending nine days later. Same sessionId, same accumulating transcript, every new Discord message getting handed the entire prior history as context. The fancy term is “quadratic context”, and it is exactly what it sounds like. Each new turn pays for itself plus every turn before it.

Running the math on this one transcript, at Moonshot’s rates of $0.60 per million input tokens, that single session shipped roughly fifty-five million input tokens to Kimi over the nine days. About thirty-three dollars of input billing, from one Discord channel that I sometimes ignored for a day at a time.

The “$100 per month” turned out to be a fairly clean N×N curve, multiplied across two or three persistent sessions. Sustained drift, not a runaway loop.

Why this happens at all

The OpenClaw default is “one session per Discord channel, keep appending forever”. That works fine for a focused short-lived chat. It is awful for a long-lived agent on a long-lived channel. The compaction system is supposed to summarize old turns before they crush the context window, but the trigger fires near the model’s hard limit, which on Kimi is 200,000 tokens. By the time it kicks in, every call up to that point has already been paid for at full size.

I had been thinking about cost as a per-call problem. The truth was the cost was a per-session-age problem, and the per-call cost was just the visible symptom.

Layer one: rotate sessions on a schedule

OpenClaw shipped native session-rotation knobs in the 2026.5.7 release I was already overdue to install. Bumping the package, the schema gained these fields at the root:

"session": {
  "idleMinutes": 240,
  "resetByType": {
    "direct": { "mode": "idle", "idleMinutes": 240 },
    "group":  { "mode": "daily", "atHour": 11 },
    "thread": { "mode": "daily", "atHour": 11 }
  },
  "resetByChannel": {
    "1475012600340418600": { "mode": "daily", "atHour": 11 }
  },
  "maintenance": {
    "mode": "enforce",
    "pruneAfter": "7d",
    "maxEntries": 50
  }
}

This is the structural fix. Even if everything else fails, a session cannot grow past four hours of idle time, or past one day of activity. Worst-case damage is bounded.

The old transcripts got archived to ~/.openclaw/sessions-archive/ in case I ever wanted to mine them. The active store starts fresh on the next message.

Layer two: cap compaction before it triggers

Compaction was set to mode: safeguard and otherwise default. That meant it triggered near the 200K hard limit. By that point the conversation had already been paid for.

The right move was to compact much, much earlier:

"compaction": {
  "mode": "safeguard",
  "maxHistoryShare": 0.08,
  "keepRecentTokens": 6000,
  "recentTurnsPreserve": 4,
  "provider": "ollama/qwen3:14b",
  "qualityGuard": { "enabled": true, "maxRetries": 2 }
}

maxHistoryShare: 0.08 means conversation history can never be more than 8% of the model’s context window. On Kimi’s 200K window that caps history at roughly 16K tokens. Plenty for a Discord agent to feel coherent, way under what causes runaway billing.

provider: ollama/qwen3:14b is the genuinely interesting line. Compaction is a long-input, short-output job. Summarize forty turns of chat into a paragraph. The output is small. The model does not need to be brilliant. It needs to read carefully and condense. That is exactly the kind of work that should not run on a frontier paid model. It belongs on the GPU I already own.

Layer three: use the 4070 that was sitting at zero percent

Speaking of the GPU. The workstation has an RTX 4070 SUPER, 12 GB of VRAM, sitting at 0% utilization for months. There was an Ollama install on it from a previous experiment, with four different models pulled, and absolutely nothing routing through it.

OpenClaw 2026.5.7’s heartbeat schema also gained a model override:

"heartbeat": {
  "every": "90m",
  "model": "ollama/qwen3:14b",
  "lightContext": true,
  "isolatedSession": true,
  "ackMaxChars": 200,
  "skipWhenBusy": true
}

Heartbeats now hit local Ollama, with light context (no dragging in the full conversation), an isolated session (no transcript pollution), and a short response cap. Free. The 4070 keeps the model warm between calls, so latency is the eval rate of a single short response, around two seconds.

For Ollama itself, four environment variables in a systemd override turned out to be the difference between “loaded but useless” and “actually working”:

[Service]
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_FLASH_ATTENTION=1"

The non-obvious one is OLLAMA_NUM_PARALLEL=1. The Ollama default is to allocate KV cache for four concurrent inference slots. For a single-user agent that is wasteful. With it set to one slot, plus Q8 quantized KV cache and flash attention, qwen3:14b at 16K context fits 100% on the GPU with about a gigabyte of headroom.

The single warm model now serves heartbeats, compaction, intent classification, and Kimi-fallback. Same model, same hot weights, just paged differently per request.

Layer four: a smart-routing proxy

The last piece was the most fun to build. Between OpenClaw and Kimi I dropped a tiny Node service on 127.0.0.1:18790. From OpenClaw’s perspective it is just another OpenAI-compatible endpoint, so models.providers.moonshot.baseUrl points at it. The proxy has two jobs.

First, it walks the message list before forwarding. Any message with role tool and content longer than 20,000 characters gets handed to local qwen3:14b with a “condense this preserving facts, names, numbers, and URLs” prompt. The condensed version replaces the original in the request that goes to Kimi. Web-search dumps and browser-scrape outputs are the typical offenders. A 5KB summary replaces a 50KB result, and Kimi never sees the bloat.

Second, it peeks at the most recent user message. If there are no tools requested and the conversation isn’t mid-tool-loop, it runs a tiny classification prompt against local qwen3:14b: chitchat or task. Chitchat means greeting, thanks, emoji-only, or social acknowledgement. Those get answered locally. Tasks get forwarded to Kimi unchanged. The classifier is conservative on purpose. False negatives just mean a Kimi call we did not have to make. False positives mean a worse reply, so the bias is toward “forward to Kimi when uncertain.”

The proxy reports stats at /stats:

calls, chitchat, condensedMsgs, condensedCharsSaved, kimiForwards, errors

The condensed-chars-saved counter directly maps to Kimi tokens not paid for.

The CPU oven incident

While building the GPU offload story, I hit a different problem that is worth a note. After everything was wired up, the workstation got noticeably hot. Hands on the desk hot. The fan was loud. I checked GPU temperature first. Fifty-eight degrees. Normal.

The CPU told a different story. Load average over seven, sustained. The top process was ollama runner at 690% CPU. Which is to say, seven cores pinned solid.

The cause was the compaction provider. A single compaction call had gone in at 40,000 tokens of input. Ollama auto-expanded the KV cache to fit. With keep_alive: -1 (forever), the bloated model state stuck around. Twenty-eight percent of the model layers no longer fit on the GPU at that context size, so they were spilling to CPU. Every subsequent request, even small heartbeats, was paying that 28%-on-CPU tax. The runner had been pinned that way for nineteen hours by the time I noticed.

The fix was the maxHistoryShare: 0.08 cap above. 16K tokens of compaction input fits comfortably on the GPU. No spill, no CPU oven. The structural cap that saved money also saved my dining-room temperature.

Layer five: a cost monitor so this can’t be silent again

The last piece is a cron job that walks the session JSONLs every night at 4:05 in the morning, estimates token volume per session, multiplies by the Kimi rate, writes a markdown report to ~/.openclaw/cost-report/<date>.md, and posts a one-liner to the Discord channel if the day’s estimate is over five dollars. Top three sessions by spend are flagged so the cause is obvious without digging.

If the system ever drifts back into a 9-day-session scenario, I see it on day 2.

Result

Day-one cost estimate after all of this: 71 cents. The first session of the new structural-cap era ran 60 assistant turns and used about a million input tokens total. Annualized at this rate the bill is around $20 to $25, with two-thirds of the headroom unused if a noisy day shows up.

Where the savings actually came from, in rough proportion:

  1. Session rotation. By far the biggest. The N×N curve is the one that hides the cost.
  2. Lower compaction trigger. Cuts effective context per call by 60% versus the default.
  3. Heartbeats off the paid model. Roughly $25 to $50 of monthly noise, just gone.
  4. Tool-output condensing. Smaller, more variable, but real on tool-heavy days.
  5. Chitchat routing. Smallest line item. Hard to overcount because the classifier is conservative.

The 4070 went from 0% to roughly 89% GPU on routine traffic, with a single warm qwen3:14b at 8K context. The eval rate is around 32 tokens per second, which for the kind of work it’s doing is essentially instant.

What I would do differently

In hindsight the right order would have been: write the cost monitor first, then look at the data, then build the fixes. I did it the other way, fixing what I assumed would be wrong before measuring, and got lucky that the obvious thing was also the right thing. The monitor is the cheapest piece by far and it would have caught this in week two.

The other thing worth saying out loud: local LLMs aren’t replacements for paid frontier models, they’re coprocessors. Kimi still handles every substantive reply. qwen3:14b on the 4070 handles the bulk-input, cheap-output work that Kimi was overpaying for. Same architecture as any well-designed pipeline. The interesting part of putting it together was not the models. It was noticing that Kimi was doing thousands of dollars per year of work that should never have been hers.