← Back to blog

The 5-Minute TTL: How Anthropic's Prompt Cache Quietly Broke Long-Running Agents

Synrouter Team11 min read
anthropicprompt-cachingclaude-codeagent-architecturecost-optimization

Anthropic's prompt caching is one of the most impactful cost-optimization features in the LLM ecosystem. Write a cache breakpoint at the right position in your prompt, and subsequent requests with matching prefixes get a 90% discount on input tokens.

There's just one catch: the cache TTL is 5 minutes. Hard-coded. Non-negotiable.

For a chatbot handling one-off Q&A, five minutes is fine. For an AI agent running a multi-hour coding session, it's catastrophic. And here's the thing — agent workloads are rapidly becoming the dominant use case for LLM APIs, not chatbots.


The Agent Token Multiplier Problem

Before we get to the TTL, let's understand why this matters. A typical agent session isn't a single request-response pair. It's a recursive loop:

User: "Add Redis caching to the auth middleware" ↓ Turn 1: Agent reads auth middleware → calls grep for Redis imports Turn 2: Agent reads cache config → calls read_file on redis.ts Turn 3: Agent writes implementation → runs typecheck ... Turn 47: Agent fixes edge case → runs test suite Turn 48: Agent handles review feedback → final commit

Each turn sends the entire conversation context — system prompt, tool schemas, and all previous messages — back to the model. By turn 40, that context might be 80,000 tokens, but 76,000 of them (95%) are bit-for-bit identical to what you sent on turn 39.

text
1PER-TURN TOKEN COMPOSITION (Turn 40, 50-turn coding session)
2 ┌────────────────────────────────────────────────────────────┐
3 │ │
4 │ ████████████████████████████████████████ 76K Cached │
5 │ ██ 4K New │
6 │ │
7 │ System prompt + tool schemas: re-sent every turn │
8 │ Project files read 10 turns ago: still in context │
9 │ Tool results from turn 5: why are these still here? │
10 │ │
11 │ → 95% of tokens are duplicates. │
12 │ Without caching, you pay full price for all of them. │
13 └────────────────────────────────────────────────────────────┘

Without cache reuse, the cost per turn stays high even though 95% of the input hasn't changed. That's the "redundancy tax" — and it's why prompt caching was such a breakthrough.


How Anthropic's Prompt Caching Works (and Why 5 Minutes)

Anthropic's implementation is straightforward. You mark specific positions in your prompt with cache_control: {"type": "ephemeral"}. Anthropic stores the prefix up to that point, and if a subsequent request starts with the same bytes, you get a 90% discount on those cached input tokens.

The cache eviction logic is equally simple: any cache entry older than 5 minutes since its last read is deleted. From Anthropic's documentation:

"Cache entries have a 5-minute TTL. The TTL refreshes with each cache read."

The 5-minute TTL isn't arbitrary — it's a reasonable engineering tradeoff for a multi-tenant service. Cache storage costs real money (memory on inference hardware isn't cheap), and a short TTL ensures stale caches don't accumulate. For 95% of chat-based API usage, 5 minutes is generous.

But agent workloads break this model in three specific ways:

1. Real Sessions Don't Fit in 5-Minute Boxes

A developer using Claude Code doesn't send back-to-back API calls in rapid succession. The rhythm of a coding session looks more like this:

text
1TYPICAL 2-HOUR CLAUDE CODE SESSION
2 ┌──────────────────────────────────────────────────────────┐
3 │ │
4 │ [0min] Prompt: "Add rate limiting to the API" │
5 │ [1min] Turn 1: Claude reads files, proposes plan │
6 │ [2min] Turn 2: Writes implementation │
7 │ [3min] Turn 3: Runs typecheck → fails, fixes │
8 │ [4min] Turn 4: Typecheck passes, writes tests │
9 │ [5min] ──── CACHE EVICTED ──── │
10 │ [7min] Turn 5: Tests fail → COLD START (+30% cost) │
11 │ [8min] Turn 6: Fixes bug │
12 │ [13min] ──── CACHE EVICTED ──── │
13 │ [14min] Developer: checks Slack, reads docs │
14 │ [22min] Turn 7: COLD START again (+30% cost) │
15 │ ... │
16 │ [120min] Session ends. Cache evicted 14 times. │
17 │ │
18 │ Cost impact: 14 cold starts × 30% surcharge │
19 │ │
20 └──────────────────────────────────────────────────────────┘

In a 2-hour session with 50 turns, the cache dies and restarts 10-15 times. Each cold start means paying full price on 75K+ tokens that should have been cached. The result: instead of a 90% discount, you might see only 50-60% effective savings.

2. The Coffee Break Penalty

The most common cache-killer isn't a long pause — it's a 6-minute distraction. A developer:

  • Runs a test suite (4 minutes of waiting)
  • Checks a Slack message (2 minutes)
  • Returns to Claude Code

That's 6 minutes. Cache gone. Your next turn costs an extra $2-3 simply because you answered a coworker's question.

3. Multi-Turn Reasoning Breaks

Advanced agent workflows increasingly involve "thinking pauses" — the agent reflects, plans, or waits for external data. Some examples:

| Workflow | Typical Gap Between Turns | |----------|--------------------------| | Code → run tests → review results | 3-8 minutes | | Research → read docs → synthesize | 5-15 minutes | | Deployment → wait for CI → fix | 10-30 minutes | | Multi-agent handoff (Agent A → Agent B) | 1-5 minutes |

Every one of these gaps exceeds or flirts with the 5-minute boundary. The cache becomes a game of roulette.


The Cost Math: Cache Hit Rate vs Session Length

Let's quantify this. We instrumented a sample of 100 real Claude Code sessions (50+ turns each) and measured the effective cache hit rate against session duration:

text
1CACHE HIT RATE vs SESSION LENGTH
2 ┌──────────────────────────────────────────────────────────┐
3 │ │
4 │ Hit Rate │
5 │ 100% │● │
6 │ │ ●● │
7 │ 80% │ ●●●● │
8 │ │ ●●●● │
9 │ 60% │ ●●●●●● │
10 │ │ ●●●●●●●● │
11 │ 40% │ ●●●●●●●● │
12 │ │ ●●●●●●●● │
13 │ 20% │ ●●●●●●●● │
14 │ │ ●│
15 │ 0% └──────────────────────────────────────────────────│
16 │ 5m 15m 30m 1h 2h 3h 4h 6h │
17 │ Session Duration │
18 │ │
19 │ ● = individual session measurement │
20 │ Trend line: hit rate = 1 / (1 + session_hours^0.6) │
21 │ │
22 └──────────────────────────────────────────────────────────┘

For a 2-hour session (the median coding session), the effective cache hit rate drops to ~65%. That means 35% of your input tokens are billed at full price instead of the cached rate.

Let's put that in dollars for a typical Claude Code session using Claude Sonnet:

| Scenario | Effective Hit Rate | Cost per Session | |----------|-------------------|------------------| | Perfect caching (theoretical max) | 95% | $8.50 | | Anthropic 5-min TTL (2-hour session) | 65% | $22.40 | | No caching at all | 0% | $68.00 |

The 5-minute TTL costs you $13.90 per session compared to what you'd pay with session-lifetime caching. Across 10 sessions per week, that's $556/month in avoidable costs.


Current Workarounds (and Their Limitations)

The developer community has developed several strategies to cope. None are great.

Workaround 1: Keep-Alive Pings

Send a dummy request every 4 minutes to refresh the cache TTL:

python
1import time, threading
2
3def keep_cache_alive(client, system_prompt, interval=240):
4 while True:
5 time.sleep(interval)
6 client.messages.create(
7 model="claude-sonnet-4-20250514",
8 max_tokens=1,
9 system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
10 messages=[{"role": "user", "content": "ping"}],
11 )

Problems: Wastes tokens on pings. Adds complexity. Doesn't survive network interruptions. And it's a hack — you're fighting the API instead of working with it.

Workaround 2: Manual cache_control Injection

Some agent frameworks inject cache_control markers at strategic positions in the conversation — typically at the last system message, last tool definition, and recent message boundaries.

Problems: Fragile. Every framework implements it differently. Easy to get wrong (wrong positions = no caching, no error). And the 5-minute TTL still applies regardless of marker placement.

Workaround 3: Shorter Sessions

Some teams just accept the limitation and restart sessions more frequently: "Every 30 minutes, start a fresh Claude Code session."

Problems: Loses context. Claude has to re-read files and re-establish understanding. The first 5-10 turns of every new session are slower (cold start on comprehension, not just cache). Productivity hit is real.

Workaround 4: Self-Hosted Cache Layer

Some teams build their own cache proxy that stores prefixes in Redis/Memcached and intercepts API calls:

Client → Custom Proxy (Redis-backed prefix cache) → Anthropic API

Problems: This is a real engineering project. You need to handle chunked streaming, cache invalidation logic, and prefix matching with byte-level precision. The teams doing this successfully are spending weeks of engineering time on infrastructure that isn't their product.


The Real Problem: Cache Lifetime ≠ Session Lifetime

All of these workarounds try to solve the same fundamental mismatch:

text
1THE CACHE-SESSION MISMATCH
2 ┌──────────────────────────────────────────────────────────┐
3 │ │
4 │ Anthropic Cache TTL: ████ 5 minutes │
5 │ │
6 │ Typical Agent Session: ██████████████████████████ │
7 │ 2 hours │
8 │ │
9 │ The cache lasts 4% as long as the session it serves. │
10 │ │
11 └──────────────────────────────────────────────────────────┘

A cache that expires at 5 minutes serves a session that lasts 120 minutes. That's a 24:1 mismatch. The cache is optimized for the API provider's infrastructure constraints (minimize memory usage per GPU), not for the application's actual usage pattern.

This isn't a criticism of Anthropic — they're transparent about the limitation, and building a multi-tenant inference service at their scale requires tradeoffs. But it does mean that agent developers are paying a significant "TTL tax" that's invisible in the per-request pricing.


What Session-Lifetime Caching Looks Like

The solution is conceptually simple: tie cache lifetime to the agent session, not to a fixed clock.

Instead of:

Cache TTL = 5 minutes (always, regardless of session state)

Use:

Cache TTL = as long as the session is active Session ends → cache can be evicted

This transforms the cost curve:

text
1COST ACCUMULATION: 5-min TTL vs Session-Lifetime Caching
2 ┌──────────────────────────────────────────────────────────┐
3 │ │
4 │ $60 │ ● │
5 │ │ ● │
6 │ $50 │ ● ○ │
7 │ │ ● ○ │
8 │ $40 │ ● ○ ○ │
9 │ │ ● ○ ○ ○ │
10 │ $30 │ ○ ○ ○ ○ │
11 │ │ ○ ○ ○ ○ │
12 │ $20 │ │
13 │ │ ● Anthropic 5-min TTL ($22.40/session) │
14 │ $10 │ ○ Session-lifetime cache ($8.50/session) │
15 │ │ │
16 │ $0 └──────────────────────────────────────────────────│
17 │ 0 10 20 30 40 50 60 70 turns │
18 │ │
19 └──────────────────────────────────────────────────────────┘

Session-lifetime caching eliminates the sawtooth pattern — every turn benefits from cached prefixes. The cost grows linearly with session length instead of spiking after each cold start.


Synrouter: Session-Aware Caching, No Code Changes

This is exactly what we built Synrouter to do.

Synrouter sits between your agent and the LLM provider as a transparent proxy. It maintains a session state store that maps your agent session to cache entries, with lifetimes that match the actual user session — not an arbitrary clock.

bash
1# Your agent code doesn't change.
2# Just point it at Synrouter:
3
4# Before
5base_url = "https://api.anthropic.com"
6
7# After
8base_url = "https://synrouter.ai/api/anthropic"

Under the hood, Synrouter:

  1. Detects session boundaries — recognizes when a new session starts vs when it's a continuation
  2. Maintains session-scoped caches — cache entries live as long as the session is active (with a configurable session TTL, e.g., 30-minute idle timeout)
  3. Automatically injects optimal cache_control breakpoints — we handle the marker placement so your framework doesn't have to
  4. Compresses tool outputs — strips noise (ANSI codes, progress bars, redundant logs) before they bloat your context

The result: a 2-hour coding session that would have a 65% effective cache hit rate with Anthropic's 5-minute TTL achieves 85-95% hit rate with session-lifetime caching.


The Numbers on a Real Session

We took a real 85-turn Claude Code session — a developer building a Stripe billing integration over a 3-hour afternoon — and ran it through three scenarios:

| Scenario | Cache Hit Rate | Total Cost | vs Baseline | |----------|---------------|------------|-------------| | No caching (raw Anthropic API) | 0% | $71.40 | — | | Anthropic 5-min TTL (Claude's built-in) | 52% | $38.20 | −46% | | Synrouter session cache | 88% | $14.80 | −79% |

The developer took two coffee breaks and answered three Slack messages during this session. Each interruption killed the 5-minute cache. Synrouter's session-level cache survived all of them.


What This Means for Agent Teams

If your team runs 100 agent sessions per week (reasonable for a 3-5 person engineering team using Claude Code daily), the math looks like this:

| Approach | Weekly Cost | Monthly Cost | Annual Cost | |----------|------------|--------------|-------------| | Raw Anthropic | $7,140 | $30,940 | $371,280 | | Anthropic 5-min TTL | $3,820 | $16,553 | $198,636 | | Synrouter | $1,480 | $6,413 | $76,960 |

That's $121,676/year saved vs Anthropic's built-in caching — and $294,320/year saved vs no caching at all. These aren't hypothetical numbers; they're extrapolated from real session traces.


The Bottom Line

Anthropic's 5-minute cache TTL isn't a bug — it's a design choice optimized for their infrastructure, not for agent workloads. As AI agents become the dominant consumer of LLM APIs, this mismatch between cache lifetime and session lifetime will only become more expensive.

Session-lifetime caching isn't just a nice-to-have optimization. For teams running agents at scale, it's the difference between a sustainable cost model and a monthly surprise on the API bill.

Synrouter is in Early Access. If you're running agents in production and want session-level caching without building your own proxy infrastructure, click to sign up — we're onboarding users weekly.


Read next: How to Cut Claude Code API Costs by 85%