You're three hours into a Claude Code session. The agent has built up a 60K-token context — system prompt, tool definitions, conversation history, file contents. Cache hits are humming along at $0.30/M instead of $3/M. Then a 429 lands.
rate_limit_error: Request was throttled. Expected available in 1.2s.
The retry kicks in. Another 429. And another. Your agent stalls, the session stalls, and the cache TTL clock keeps ticking. By the time the rate limit window resets, your 5-minute cache has expired. Every subsequent request pays full input price again.
We've watched this exact pattern burn $30+ in a single afternoon for a team of three developers sharing one Anthropic API key.
The Rate Limit Wall
Anthropic enforces two limits per organization: requests per minute (RPM) and tokens per minute (TPM). They're per-org, not per-key — adding a second API key to the same org doesn't double your throughput.
| Tier | RPM | Input TPM | Daily Output Cap |
|---|---|---|---|
| Build | 5 | 20,000 | 300,000 |
| Tier 1 | 50 | 500,000 | 2,000,000 |
| Tier 2 | 1,000 | 80,000 | 10,000,000 |
Sources: Anthropic rate limits docs, TokenCalculator analysis (April 2026).
Tier 1 input TPM jumped from 30K to 500K after the Colossus 1 compute deal — a massive improvement. But consider what a single Claude Code request looks like: 20-30K tokens of system prompt and tool definitions, plus whatever context the agent has accumulated. One long session can burn through 500K TPM in under 30 requests.
Add a second developer, a parallel subagent, or a CI pipeline hitting the same org? You're back to 429s.
The Naive Fix: Round-Robin Across Keys
The obvious workaround: get API keys from multiple Anthropic organizations (separate accounts, separate billing). Distribute requests across them. Problem solved?
Not quite. Here's what happens with naive round-robin:
You're paying double the cache write costs, and your effective cache hit rate is 50% of what it should be — each key only sees half the traffic. With three keys, it drops to 33%. With five, 20%.
We wrote about this in The Claude Cache TTL Trap — naive multi-key setups can actually increase your token spend if you're not careful about cache locality.
The Right Fix: Cache-Aware Routing
What you need is session-sticky routing: requests from the same Claude Code session go to the same API key, preserving cache hits. Only when that key hits a rate limit should you failover to a different one.
LiteLLM supports this via the PromptCachingDeploymentCheck pre-call check. Here's a working config:
The prompt_caching_deployment_check ensures that once a cache entry is written on Key A, subsequent requests with overlapping cache prefixes route back to Key A — not to Key B. Failover only triggers when Key A returns a 429 or 500.
Source: LiteLLM Prompt Cache Routing docs.
What This Actually Saves
We tested three configurations with a 5-developer team running Claude Code for a full workday (~8 hours, ~400 requests, ~40K avg input tokens per request):
| Setup | Cache Hit Rate | Effective Input Cost | 429 Errors |
|---|---|---|---|
| Single key | 55% | $27/day | 14 |
| Multi-key round-robin | 38% | $35/day | 0 |
| Multi-key cache-aware | 71% | $20/day | 0 |
The single-key cache hit rate looks low at 55% — that's the point. The 14 rate-limit hits during the day caused cascading cache expirations: each 429 stall outlasted the 5-minute TTL, forcing full-price re-reads on the next request. The nominal cache hit rate without 429s was 72%, but the effective rate after accounting for TTL expirations dropped to 55%.
Round-robin eliminated the 429s but increased cost by 28% due to cache fragmentation across keys. Cache-aware routing cut both: zero 429s and a 26% cost reduction from the single-key baseline. The multi-key setup absorbed traffic spikes that would otherwise have triggered cache-evicting delays.
For more on why agent costs spiral when caching breaks, see The Agent Tax: Why Your AI Agent Costs 10x More Than You Expected.
The Operational Headache Nobody Mentions
Configuring LiteLLM correctly is step one. Operating it is the real problem:
- One of your API keys gets throttled — does your retry logic actually work, or does it just fail silently?
- A key expires or hits a billing limit mid-session — your agent stalls with an opaque error.
- You add a fourth key — now you need to update three config files across dev, staging, and prod.
- Claude Code's
ANTHROPIC_API_KEYenv var only accepts one key. You need a proxy layer.
We built Synrouter partly because we got tired of babysitting LiteLLM configs. The core idea: bring your own API keys, and we handle session-sticky routing, cache-aware failover, and per-key health monitoring automatically. You point Claude Code at https://synrouter.ai/api/anthropic instead of https://api.anthropic.com, and the routing layer handles the rest.
No config files. No PromptCachingDeploymentCheck YAML. Just a dashboard showing which keys are healthy, which are rate-limited, and what your effective cache hit rate actually is. Sign up to get started — we're onboarding users weekly.
FAQ
Do I need multiple Anthropic organizations, or just multiple API keys?
Rate limits are per-organization. Multiple API keys within the same org share the same RPM/TPM pool. To actually increase throughput, you need keys from separate Anthropic orgs (separate accounts, separate billing).
Does multi-key routing work with Claude Code's native caching?
Yes — Claude Code sends cache_control headers on its requests. As long as the same key handles requests with the same cache prefix, the cache works normally. The PromptCachingDeploymentCheck in LiteLLM (or Synrouter's built-in session routing) ensures this.
What about OpenAI / GPT-5.5?
OpenAI doesn't have prompt caching in the same way (they offer automatic context caching with different mechanics). Multi-key load balancing still helps with rate limits, but the cache fragmentation problem is less severe.
Read next: LiteLLM Alternative in 2026: Synrouter vs LiteLLM Compared