Not every Claude Code turn deserves Claude

Kurt Overmier & AEGIS 6 min read

I've been running AEGIS — my personal AI kernel — autonomously for months. Overnight sprints, cron-triggered self-improvement, agentic pipelines that spawn Claude Code sessions to work through issue queues.

At some point I looked at the usage data and noticed something obvious in retrospect: a huge fraction of turns were doing things like "summarize this 400-line file", "plan the next step", "draft this boilerplate function". Things that don't need frontier reasoning. Things that cost the same as the turns where Claude is actually working hard across a tool loop.

So I built a router. It's called bildy, and I'm putting it on GitHub today.


What it does

bildy is a local proxy that sits between Claude Code (or Codex) and the Anthropic API. It intercepts each request, classifies the cognitive load, and routes accordingly:

Route class What it catches Default routing
tool_loop Claude calling tools, reading results Anthropic (keep it)
long_context Large inputs, cross-file reasoning NVIDIA NIM, Groq
planning "what should I do next?" turns Workers AI, Groq, Cerebras
code_draft Boilerplate, repetitive generation Workers AI, Groq, Cerebras
summary "explain this", "summarize that" Workers AI, Cerebras, Groq

Classification happens locally — no extra API call. The routing config is a JSON file you control.


Shadow mode first

The part I built most carefully: shadow mode is on by default.

When you first run bildy, everything still routes to Anthropic. The gateway just logs what it would have done. After running it against my own AEGIS sessions, /shadow/stats returned this:

{
  "shadowMode": true,
  "totalRequests": 519,
  "shadowedRequests": 40,
  "totalProjectedSavingsUsd": 0.207,
  "byRoute": {
    "summary":       { "count": 13, "projectedSavingsUsd": 0.149 },
    "planning":      { "count": 9,  "projectedSavingsUsd": 0.045 },
    "fallback_safe": { "count": 10, "projectedSavingsUsd": 0     },
    "tool_loop":     { "count": 4,  "projectedSavingsUsd": 0     },
    "long_context":  { "count": 3,  "projectedSavingsUsd": 0     },
    "code_draft":    { "count": 1,  "projectedSavingsUsd": 0.012 }
  }
}

40 of 519 turns were classified as routable to cheaper providers. The summary class drove most of the projected savings — ~$0.045 saved per summary turn routed to Cerebras instead of Anthropic. The tool_loop turns correctly stayed on Anthropic at $0 savings, which is exactly what you want: Claude for the hard parts, cheap inference for everything else.

You decide when to trust the numbers. Go live on summary first, watch for a week, then add planning. The config supports enabling one class at a time:

{
  "routing": {
    "shadowMode": true,
    "shadowRoutes": { "summary": false }
  }
}

CF Workers AI as the default cheap provider

When you check the routing catalog, Workers AI appears as the recommended provider for every cheap route class:

{
  "summary":       { "recommended": { "provider": "cloudflare", "model": "@cf/zai-org/glm-4.7-flash" } },
  "planning":      { "recommended": { "provider": "cloudflare", "model": "@cf/openai/gpt-oss-120b" } },
  "code_draft":    { "recommended": { "provider": "cloudflare", "model": "@cf/openai/gpt-oss-120b" } },
  "fallback_safe": { "recommended": { "provider": "cloudflare", "model": "@cf/openai/gpt-oss-120b" } }
}

Workers AI is free within Cloudflare's generous limits. If you already have a Cloudflare account, CLOUDFLARE_ACCOUNT_ID + CLOUDFLARE_API_TOKEN is all it takes. Add AI_GATEWAY_ID and those turns also flow through CF's AI Gateway for cache and observability.

Groq and Cerebras are in the fallback chain for each class. NVIDIA NIM handles long_context as primary.


What the pattern looks like

On long AEGIS sprints — multi-hour sessions fixing grounding logic, schema migrations, self-improvement runs — the pattern is consistent: a significant fraction of turns are prefaced with things like:

  • "Read this file and tell me what the grounding logic does"
  • "Draft a migration for this schema change"
  • "What's the order of operations for this fix?"

These are summary, code_draft, and planning turns. They don't need Claude's reasoning depth. The turns that need Claude are the actual tool loops — reading a test failure, deciding what to change across multiple files simultaneously, writing the fix. Those stay on Anthropic. The shadow stats show exactly which class each turn fell into.


Route classes as the extension model

The config surface is route classes, not individual providers. You map classes to ordered provider lists — first compatible provider wins:

{
  "routing": {
    "routes": {
      "planning":   ["groq", "cerebras"],
      "summary":    ["cerebras", "groq"],
      "code_draft": ["cloudflare", "groq"]
    }
  }
}

Provider down? Next in list. The gateway picks the right model from the provider's catalog for the turn type — you don't hardcode model names.

Supported providers: cloudflare, groq, cerebras, anthropic, openai, nvidia. New provider integrations aren't accepted in PRs — use route classes to remap to an existing provider in your local config.


Install (5 minutes, Workers AI or Groq free tier)

# Clone and install
git clone https://github.com/Stackbilt-dev/bildy.git
cd bildy && npm install

# Add at least one provider key (Workers AI is free on CF free tier)
export CLOUDFLARE_ACCOUNT_ID=your-account-id
export CLOUDFLARE_API_TOKEN=your-api-token
# OR
export GROQ_API_KEY=gsk_...

# Also set your Anthropic key (shadow mode still routes here)
export ANTHROPIC_API_KEY=sk-ant-...

# Point Claude Code at the gateway
export ANTHROPIC_BASE_URL=http://localhost:8787
export ANTHROPIC_API_KEY=local-dev-key

# Start gateway + launch
npm run start &
claude

Shadow mode is on. Run a session. Check /shadow/stats. The numbers you see are yours — not a projection from my workflow.

Interactive setup wizard also available (bildy init), but the env-variable path is more predictable. When something breaks, the first thing to check is obvious.


What this is and isn't

Solodev workflow tool built for my own use. macOS and Linux only. If it breaks your shell, you keep both pieces.

Not a hosted service. Not SaaS. Local proxy — you run it, you own the keys. The gateway sees request classifications, not your code.

The full platform version of this thinking — multi-model routing with cost controls, quota management, and team-level governance at the Cloudflare edge — is what we're building at Stackbilder. bildy is the local tooling arm of the same architecture, and it's how I've been developing and stress-testing the routing model.


GitHub: Stackbilt-dev/bildy — shadow mode on by default, five providers, MIT license. Workers AI free tier recommended.

Written by Kurt Overmier & AEGIS. Published on The Roundtable.
Learn more at stackbilder.com →