← Back to blog

The Hidden Cost of Claude's Prompt Caching: Why Your Agent Is Bleeding Tokens

Synrouter Team6 min read
claude prompt cachingagent cost optimizationanthropic api pricingtoken wasteclaude-codemulti-key llm proxyagent session sticky routing

We've now looked at the usage metadata on more than 10 million Claude API calls running through our infra. One pattern shows up again and again: a team flips on prompt caching, expects a 90% discount, and watches their bill go up.

Caching is not a free win. Below a certain hit rate it's a tax. Here's the math nobody puts on the pricing page.


The TTL Trap Most People Walked Into

Anthropic's caching charges a premium to write a cache entry and gives you a steep discount to read one. The numbers that matter, per million input tokens on Sonnet 4.6:

Token typePrice (per 1M)vs standard input
Standard input$3.00
Cache write (5-min TTL)$3.75+25%
Cache write (1-hour TTL)$6.00+100%
Cache read (hit)$0.30−90%

Benchmark: figures measured across 50 concurrent agent sessions, each carrying a ~30k-token shared context prefix, on Sonnet 4.6 through our infra.

The write premium is the part people forget. You pay extra to put something in the cache. You only come out ahead if you read it back enough times to amortize that premium.

And here's the catch that bit a lot of teams in early 2026: the default TTL is 5 minutes. It used to effectively be an hour for Claude Code.

Around March 6, 2026 Anthropic quietly flipped the default for any cache_control block without an explicit ttl back to 300 seconds. No changelog entry, no migration note. The docs still read "up to 1 hour." Here's why it split the room cleanly in two: older Claude Code builds pinned ttl: 3600 explicitly in the request, so they kept getting hour-long caches and never felt a thing. The newer builds dropped the explicit field and inherited the new 300-second default. Same tool, same workload, two wildly different bills — and the only difference was whether your client happened to send a ttl it used to not need to send. Teams on the new build watched their cache hit rate fall off a cliff and their token spend climb, with nothing in their own code to point at.

You can still get the 1-hour tier. You just have to ask for it:

json
1{ "type": "ephemeral", "ttl": 3600 }

The trade is right there in the table above: 1-hour writes cost +100% instead of +25%. Worth it if your agent idles between turns. A waste if your prompts churn.

So picture a long-running agent. It does a human-in-the-loop step, or it idles while a web search comes back, and the 5-minute window lapses. When it wakes up and resends the conversation history, you don't just miss the read discount — you pay the write premium all over again.

We see this exact sequence constantly. A dev fires off "refactor the auth module" in Claude Code. The agent reads six files, writes a plan, and stops to ask for confirmation. The dev tabs over to Slack, gets pulled into a thread, answers a question, grabs coffee. Eight minutes go by. They tab back, hit yes, proceed — and that 40k-token context the agent had warm in cache? Gone. The next turn re-ingests the whole thing at the write premium, not the read discount. The dev didn't do anything wrong. They just looked away for longer than 300 seconds, which on any real workday is most of the time.

The break-even is unforgiving:

If your hit rate sits below ~22%, caching makes your API bill more expensive than not caching at all.

So ask yourself: how many of your agent's turns actually land inside a five-minute window? If you don't know the number, you're not saving money — you're gambling on it.

$0.30 reads sound cheap. They only stay cheap if you actually land the reads.

Round-Robin Quietly Murders Your Cache

The second leak is the one nobody instruments: stateless proxying.

Most teams park a gateway like LiteLLM in front of their models. When Agent A and Agent B both ask for Sonnet at the same time, a stateless balancer round-robins them across different upstream API keys or IP circuits — the standard trick for dodging rate limits. We walked through exactly how this destroys your cache hit rate in LiteLLM vs Native Agent Gateways.

So what happens to the cache you just paid a premium to write?

It evaporates. Anthropic's cache is scoped to the session and the key serving it. Round-robin routing across keys guarantees a cache miss on every hop. You built a rate-limit dodge and accidentally bought a token bonfire.

How to Measure the Hemorrhage

You can't fix what you can't see.

The API hands you cache_creation_input_tokens and cache_read_input_tokens in the usage metadata on every response. The data is right there.

Parsing it across thousands of agent runs, by hand, across frameworks — that's the part nobody wants to do. So we open-sourced the part nobody wants to do.

Agentgauge is an MIT-licensed CLI that reads your execution layer and surfaces the cache hit rate right in your terminal:

bash
1npx agentgauge test -f my_agent_script.ts --track-cache
2
3> Agentgauge Diagnostics
4> Target: claude-sonnet-4-6
5> Runs: 15
6> Cache Hit Rate: 12% ⚠️ (below break-even — caching is costing you money)
7> Estimated Token Waste: $14.50

A 12% hit rate isn't a tuning opportunity. It's a net-negative ROI, and the tool says so out loud.

Fixing It: Context-Aware Routing

Measuring is step one. Fixing it means replacing stateless round-robin with infra that actually knows which agent is talking.

The gateway needs to hash the system prompt plus conversation history and consistently pin requests carrying the same hash to the same upstream connection — the one already holding the warm cache. Pin the context, keep the cache.

Bolting that logic into a Python framework adds latency you'll feel on every turn. That's why we built Synrouter as an Agent-first inference gateway that sits below the application layer: requests with identical context always land on the same warmed cache, no agent-side changes required.

We built it because our own bill kept doing things we didn't expect, and the 5-minute TTL flip was the last straw.

If any of this sounds like your dashboard, do it in order. First, measure — point Agentgauge at your stack and get your real cache hit rate. Numbers, not vibes. If it comes back above the break-even line, great, you're fine, stop reading. If it comes back at 12%, you now know exactly how much the TTL is costing you. Then fix it — move the routing below your app layer so the same context keeps hitting the same warm cache, whether that's Synrouter or something you build yourself. Diagnose first, route second. Don't rewrite infra you haven't measured.

Read next: The 5-Minute TTL: How Anthropic's Prompt Cache Quietly Broke Long-Running Agents See also: Claude Code API Pricing: Token-Level Cost Breakdown — the dollar numbers behind every scenario above.