LiteLLM is the default way to proxy LLM calls, and for good reason. It normalizes OpenAI and Anthropic schemas cleanly, and for a chatbot or a single-threaded script it's exactly the right tool.
Then you scale into a multi-agent setup, and the thing that made it great — being a thin, stateless proxy — turns into the thing bleeding your budget. We hit that wall ourselves. This is what we found on the other side of it.
The Stateless Proxy Problem
Standard load balancing was designed for stateless web traffic. Agents are the opposite of stateless.
An agent generates enormous context windows. It loops over codebase files, search results, and chat history, dragging the whole accumulated context along on every turn. Run 50 of them at once and your provider's RPM and TPM ceilings arrive almost immediately.
A stateless gateway like LiteLLM answers that by spreading load across multiple API keys:
This kills your 429s. It also creates a much larger cost you never see on the rate-limit dashboard.
And it's not a config you can tune away. LiteLLM picks an upstream key per request, with no memory of which key served this agent's previous turn — there's no session state to pin to. Sticky, cache-aware routing isn't a missing flag; it's outside what a stateless proxy is architecturally able to do.
The Cost of Context Thrashing
Because routing is round-robin (or random), Agent A might land on key_1 for turn 1 and key_2 for turn 2.
Every time an agent jumps to a different upstream, the provider-side prompt cache for that session is abandoned — the cache is scoped to the key that wrote it. So you pay full freight to re-ingest a 100k-token prompt that was already cached two seconds ago on the other key. We call this context thrashing — an agent's warm cache getting thrown away mid-session because consecutive turns land on different upstream keys. See The Hidden Cost of Claude's Prompt Caching for the cache-write math that makes this so expensive.
You traded a 429 for a 4x-plus inflation of your token bill. Quietly.
| Gateway architecture | API key utilization | Cache hit rate | Cost per 1k turns |
|---|---|---|---|
| Stateless (round-robin) | 99% (evenly spread) | < 5% | $350 |
| Context-aware (stateful) | 85% (pinned by agent) | > 85% | $82 |
Benchmark: 50 concurrent agent sessions, each replaying a ~100k-token context window over 1,000 total turns against Sonnet 4.6. Cost reflects cache-read vs full re-ingest at standard rates.
Even key utilization looks great on a Grafana panel. It's also the exact thing destroying your cache. The metric you're optimizing is fighting the metric you're paying for.
Context-Aware Routing: The Synrouter Approach
A gateway built for agents has to know something the agent knows: which conversation this request belongs to.
Instead of routing by target model, route by session fingerprint:
- Identify the agent instance and its shared context.
- Hash that context.
- Pin the hash to the specific upstream connection already holding the warm cache.
That means inspecting the payload before you route it — not just reading the model field and moving on. This is the core engineering pivot behind Synrouter: it inspects the tool_use_id and dialogue history so a given agent's thought loop is mathematically pinned to the node with its active cache. Same context, same connection, every turn.
Diagnose the Damage First
Not sure context thrashing is what's happening to you? Don't rewrite your infra on a hunch.
Measure it. We built the MIT-licensed Agentgauge for exactly this. Point it at your proxy endpoint and it diagnoses your cache hit rate across concurrent simulated agent runs:
An 88% direct rate collapsing to 4% behind your own proxy is not a tuning problem. It's the round-robin logic throwing away the cache on every hop.
We built Synrouter because our own proxy was doing this to us, and no amount of LiteLLM config fixed it — the statelessness was the point of the tool, and the problem for our workload. If your audit comes back looking like the one above, that's the signal it's time to move routing below the application layer.
Read next: Codex vs Claude Code: Why 'Pick One' Is the Wrong Question — the multi-model routing strategy that amplifies these savings.
Read next: The 5-Minute TTL: How Anthropic's Prompt Cache Quietly Broke Long-Running Agents