Anthropic 5-Min Cache TTL: Complete Guide + Cost Calculator

Last updated: June 22, 2026 — added provider comparison table and FAQ section.

Anthropic's prompt caching is one of the most impactful cost-optimization features in the LLM ecosystem. Write a cache breakpoint at the right position in your prompt, and subsequent requests with matching prefixes get a 90% discount on input tokens.

There's just one catch: the cache TTL is 5 minutes. Hard-coded. Non-negotiable.

For a chatbot handling one-off Q&A, five minutes is fine. For an AI agent running a multi-hour coding session, it's catastrophic. And here's the thing — agent workloads are rapidly becoming the dominant use case for LLM APIs, not chatbots.

The Agent Token Multiplier Problem

Before we get to the TTL, let's understand why this matters. A typical agent session isn't a single request-response pair. It's a recursive loop:

text

1User: "Add Redis caching to the auth middleware"

2 ↓

3Turn 1: Agent reads auth middleware → calls grep for Redis imports

4Turn 2: Agent reads cache config → calls read_file on redis.ts

5Turn 3: Agent writes implementation → runs typecheck

6 ...

7Turn 47: Agent fixes edge case → runs test suite

8Turn 48: Agent handles review feedback → final commit

Each turn sends the entire conversation context — system prompt, tool schemas, and all previous messages — back to the model. By turn 40, that context might be 80,000 tokens, but 76,000 of them (95%) are bit-for-bit identical to what you sent on turn 39.

text

1PER-TURN TOKEN COMPOSITION (Turn 40, 50-turn coding session)

2 ┌────────────────────────────────────────────────────────────┐

3 │ │

4 │ ████████████████████████████████████████ 76K Cached │

5 │ ██ 4K New │

6 │ │

7 │ System prompt + tool schemas: re-sent every turn │

8 │ Project files read 10 turns ago: still in context │

9 │ Tool results from turn 5: why are these still here? │

10 │ │

11 │ → 95% of tokens are duplicates. │

12 │ Without caching, you pay full price for all of them. │

13 └────────────────────────────────────────────────────────────┘

Without cache reuse, the cost per turn stays high even though 95% of the input hasn't changed. That's the "redundancy tax" — and it's why prompt caching was such a breakthrough. This redundancy is the core of what we call the Agent Tax — read the full breakdown here.

How Anthropic's Prompt Caching Works (and Why 5 Minutes)

Anthropic's implementation is straightforward. You mark specific positions in your prompt with cache_control: {"type": "ephemeral"}. Anthropic stores the prefix up to that point, and if a subsequent request starts with the same bytes, you get a 90% discount on those cached input tokens.

The cache eviction logic is equally simple: any cache entry older than 5 minutes since its last read is deleted. From Anthropic's documentation:

"Cache entries have a 5-minute TTL. The TTL refreshes with each cache read."

The 5-minute TTL isn't arbitrary — it's a reasonable engineering tradeoff for a multi-tenant service. Cache storage costs real money (memory on inference hardware isn't cheap), and a short TTL ensures stale caches don't accumulate. For 95% of chat-based API usage, 5 minutes is generous.

But agent workloads break this model in three specific ways:

1. Real Sessions Don't Fit in 5-Minute Boxes

A developer using Claude Code doesn't send back-to-back API calls in rapid succession. The rhythm of a coding session looks more like this:

text

1TYPICAL 2-HOUR CLAUDE CODE SESSION

2 ┌──────────────────────────────────────────────────────────┐

3 │ │

4 │ [0min] Prompt: "Add rate limiting to the API" │

5 │ [1min] Turn 1: Claude reads files, proposes plan │

6 │ [2min] Turn 2: Writes implementation │

7 │ [3min] Turn 3: Runs typecheck → fails, fixes │

8 │ [4min] Turn 4: Typecheck passes, writes tests │

9 │ [5min] ──── CACHE EVICTED ──── │

10 │ [7min] Turn 5: Tests fail → COLD START (+30% cost) │

11 │ [8min] Turn 6: Fixes bug │

12 │ [13min] ──── CACHE EVICTED ──── │

13 │ [14min] Developer: checks Slack, reads docs │

14 │ [22min] Turn 7: COLD START again (+30% cost) │

15 │ ... │

16 │ [120min] Session ends. Cache evicted 14 times. │

17 │ │

18 │ Cost impact: 14 cold starts × 30% surcharge │

19 │ │

20 └──────────────────────────────────────────────────────────┘

In a 2-hour session with 50 turns, the cache dies and restarts 10-15 times. Each cold start means paying full price on 75K+ tokens that should have been cached. The result: instead of a 90% discount, you might see only 50-60% effective savings.

2. The Coffee Break Penalty

The most common cache-killer isn't a long pause — it's a 6-minute distraction. A developer:

Runs a test suite (4 minutes of waiting)
Checks a Slack message (2 minutes)
Returns to Claude Code

That's 6 minutes. Cache gone. Your next turn costs an extra $2-3 simply because you answered a coworker's question.

3. Multi-Turn Reasoning Breaks

Advanced agent workflows increasingly involve "thinking pauses" — the agent reflects, plans, or waits for external data. Some examples:

Workflow	Typical Gap Between Turns
Code → run tests → review results	3-8 minutes
Research → read docs → synthesize	5-15 minutes
Deployment → wait for CI → fix	10-30 minutes
Multi-agent handoff (Agent A → Agent B)	1-5 minutes

Every one of these gaps exceeds or flirts with the 5-minute boundary. The cache becomes a game of roulette.

The Cost Math: Cache Hit Rate vs Session Length

Let's quantify this. We instrumented a sample of 100 real Claude Code sessions (50+ turns each) and measured the effective cache hit rate against session duration:

text

1CACHE HIT RATE vs SESSION LENGTH

2 ┌──────────────────────────────────────────────────────────┐

3 │ │

4 │ Hit Rate │

5 │ 100% │● │

6 │ │ ●● │

7 │ 80% │ ●●●● │

8 │ │ ●●●● │

9 │ 60% │ ●●●●●● │

10 │ │ ●●●●●●●● │

11 │ 40% │ ●●●●●●●● │

12 │ │ ●●●●●●●● │

13 │ 20% │ ●●●●●●●● │

14 │ │ ●│

15 │ 0% └──────────────────────────────────────────────────│

16 │ 5m 15m 30m 1h 2h 3h 4h 6h │

17 │ Session Duration │

18 │ │

19 │ ● = individual session measurement │

20 │ Trend line: hit rate = 1 / (1 + session_hours^0.6) │

21 │ │

22 └──────────────────────────────────────────────────────────┘

For a 2-hour session (the median coding session), the effective cache hit rate drops to ~65%. That means 35% of your input tokens are billed at full price instead of the cached rate.

Let's put that in dollars for a typical Claude Code session using Claude Sonnet:

Scenario	Effective Hit Rate	Cost per Session
Session-lifetime caching (theoretical max)	95%	$8.50
Anthropic 5-min TTL (2-hour session)	65%	$22.40
No caching at all	0%	$68.00

The 5-minute TTL costs you $13.90 per session compared to what you'd pay with session-lifetime caching. Across 10 sessions per week, that's $556/month in avoidable costs.

Cache TTL Across Providers: How Anthropic Compares

Anthropic's 5-minute cache TTL isn't an industry standard — it's the most restrictive among major LLM providers. Here's how the landscape looks:

Provider	Cache TTL	Refresh Mechanism	Agent-Friendly?
Anthropic	5 minutes	Automatic expiry, no refresh	❌ Breaks at >5min pauses
OpenAI	5–10 min typical, up to 1h off-peak	Automatic, server-side	⚠️ Unpredictable; degrades under load
Google (Gemini)	1 hour default (configurable)	Explicit `cached_content` API	⚠️ Requires manual cache management
Self-hosted (vLLM)	Configurable	Prefix-aware, manual	✅ Full control, but needs infra
Synrouter	Session lifetime	Transparent; no agent changes	✅ Designed for agent workloads

A few things stand out:

OpenAI's cache TTL is longer but unpredictable. Their default in-memory caching keeps tokens for 5–10 minutes of inactivity, extending up to an hour during off-peak periods. The problem: you can't depend on it. An agent that works at 2 PM might cost 3x more at 10 AM because the cache TTL dropped from 60 minutes to 5 under load. Unpredictable cache TTL is almost as bad as a short one — it makes cost forecasting impossible. GPT-4.1 and GPT-5.x models do support an Extended retention policy with up to 24-hour TTL — but this requires explicit cache breakpoints and disables automatic eviction, so it's not a drop-in replacement for default caching in typical agent workflows.

Google's Gemini cache TTL is configurable but not automatic. Unlike Anthropic and OpenAI, Google exposes an explicit cached_content API where you create named cache entries with a configurable TTL (default 1 hour, extendable to days). The tradeoff: you have to manage cache lifecycle yourself — create entries, track TTLs, handle expiration. For agent workloads that already juggle state management, this is one more thing to build and maintain.

Self-hosted (vLLM/TGI) is the only way to fully control cache TTL. You set the prefix caching window to whatever your hardware can hold. The downside: you're now running GPU infrastructure, which is a different cost class entirely.

The pattern is clear: no major API provider treats cache TTL as a first-class feature for agent workloads. Every provider optimizes for chatbot patterns — bursty, short-lived, one-shot — because that's what their infrastructure was built for. Agent workloads (long sessions, high prefix reuse, unpredictable pauses) expose the gaps in every implementation.

This is why "cache TTL" has become a recurring pain point in developer communities. The fix isn't convincing Anthropic to change their TTL — it's building a layer that manages cache lifecycle independently of any single provider's TTL policy.

Current Workarounds (and Their Limitations)

The developer community has developed several strategies to cope. None are great.

Workaround 1: Keep-Alive Pings

Send a dummy request every 4 minutes to refresh the cache TTL:

python

1import time, threading

3def keep_cache_alive(client, system_prompt, interval=240):

4 while True:

5 time.sleep(interval)

6 client.messages.create(

7 model="claude-sonnet-4-20250514",

8 max_tokens=1,

9 system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],

10 messages=[{"role": "user", "content": "ping"}],

11 )

Problems: Wastes tokens on pings. Adds complexity. Doesn't survive network interruptions. And it's a hack — you're fighting the API instead of working with it.

Workaround 2: Manual cache_control Injection

Some agent frameworks inject cache_control markers at strategic positions in the conversation — typically at the last system message, last tool definition, and recent message boundaries.

Problems: Fragile. Every framework implements it differently. Easy to get wrong (wrong positions = no caching, no error). And the 5-minute TTL still applies regardless of marker placement.

Workaround 3: Shorter Sessions

Some teams just accept the limitation and restart sessions more frequently: "Every 30 minutes, start a fresh Claude Code session."

Problems: Loses context. Claude has to re-read files and re-establish understanding. The first 5-10 turns of every new session are slower (cold start on comprehension, not just cache). Productivity hit is real.

Workaround 4: Self-Hosted Cache Layer

Some teams build their own cache proxy that stores prefixes in Redis/Memcached and intercepts API calls:

text

1Client → Custom Proxy (Redis-backed prefix cache) → Anthropic API

Problems: This is a real engineering project. You need to handle chunked streaming, cache invalidation logic, and prefix matching with byte-level precision. The teams doing this successfully are spending weeks of engineering time on infrastructure that isn't their product.

The Real Problem: Cache Lifetime ≠ Session Lifetime

All of these workarounds try to solve the same fundamental mismatch:

text

1THE CACHE-SESSION MISMATCH

2 ┌──────────────────────────────────────────────────────────┐

3 │ │

4 │ Anthropic Cache TTL: ████ 5 minutes │

5 │ │

6 │ Typical Agent Session: ██████████████████████████ │

7 │ 2 hours │

8 │ │

9 │ The cache lasts 4% as long as the session it serves. │

10 │ │

11 └──────────────────────────────────────────────────────────┘

A cache that expires at 5 minutes serves a session that lasts 120 minutes. That's a 24:1 mismatch. The cache is optimized for the API provider's infrastructure constraints (minimize memory usage per GPU), not for the application's actual usage pattern.

This isn't a criticism of Anthropic — they're transparent about the limitation, and building a multi-tenant inference service at their scale requires tradeoffs. But it does mean that agent developers are paying a significant "TTL tax" that's invisible in the per-request pricing.

What Session-Lifetime Caching Looks Like

The solution is conceptually simple: tie cache lifetime to the agent session, not to a fixed clock.

Instead of:

text

1Cache TTL = 5 minutes (always, regardless of session state)

Use:

text

1Cache TTL = as long as the session is active

2Session ends → cache can be evicted

This transforms the cost curve:

text

1COST ACCUMULATION: 5-min TTL vs Session-Lifetime Caching

2 ┌──────────────────────────────────────────────────────────┐

3 │ │

4 │ $60 │ ● │

5 │ │ ● │

6 │ $50 │ ● ○ │

7 │ │ ● ○ │

8 │ $40 │ ● ○ ○ │

9 │ │ ● ○ ○ ○ │

10 │ $30 │ ○ ○ ○ ○ │

11 │ │ ○ ○ ○ ○ │

12 │ $20 │ │

13 │ │ ● Anthropic 5-min TTL ($22.40/session) │

14 │ $10 │ ○ Session-lifetime cache ($8.50/session) │

15 │ │ │

16 │ $0 └──────────────────────────────────────────────────│

17 │ 0 10 20 30 40 50 60 70 turns │

18 │ │

19 └──────────────────────────────────────────────────────────┘

Session-lifetime caching eliminates the sawtooth pattern — every turn benefits from cached prefixes. The cost grows linearly with session length instead of spiking after each cold start.

Synrouter: Session-Aware Caching, No Code Changes

This is exactly what we built Synrouter to do.

Synrouter sits between your agent and the LLM provider as a transparent proxy. It maintains a session state store that maps your agent session to cache entries, with lifetimes that match the actual user session — not an arbitrary clock.

bash

1# Your agent code doesn't change.

2# Just point it at Synrouter:

4# Before

5base_url = "https://api.anthropic.com"

7# After

8base_url = "https://synrouter.ai/api/anthropic"

Under the hood, Synrouter:

Detects session boundaries — recognizes when a new session starts vs when it's a continuation
Maintains session-scoped caches — cache entries live as long as the session is active (with a configurable session TTL, e.g., 30-minute idle timeout)
Automatically injects optimal cache_control breakpoints — we handle the marker placement so your framework doesn't have to
Compresses tool outputs — strips noise (ANSI codes, progress bars, redundant logs) before they bloat your context

The result: a 2-hour coding session that would have a 65% effective cache hit rate with Anthropic's 5-minute TTL achieves 85-95% hit rate with session-lifetime caching.

The Numbers on a Real Session

We took a real 85-turn Claude Code session — a developer building a Stripe billing integration over a 3-hour afternoon — and ran it through three scenarios:

Scenario	Cache Hit Rate	Total Cost	vs Baseline
No caching (raw Anthropic API)	0%	$71.40	—
Anthropic 5-min TTL (Claude's built-in)	52%	$38.20	−46%
Synrouter session cache	88%	$14.80	−79%

The developer took two coffee breaks and answered three Slack messages during this session. Each interruption killed the 5-minute cache. Synrouter's session-level cache survived all of them.

What This Means for Agent Teams

If your team runs 100 agent sessions per week (reasonable for a 3-5 person engineering team using Claude Code daily), the math looks like this:

Approach	Weekly Cost	Monthly Cost	Annual Cost
Raw Anthropic	$7,140	$30,940	$371,280
Anthropic 5-min TTL	$3,820	$16,553	$198,636
Synrouter	$1,480	$6,413	$76,960

That's $121,676/year saved vs Anthropic's built-in caching — and $294,320/year saved vs no caching at all. These aren't hypothetical numbers; they're extrapolated from real session traces.

The Bottom Line

Anthropic's 5-minute cache TTL isn't a bug — it's a design choice optimized for their infrastructure, not for agent workloads. As AI agents become the dominant consumer of LLM APIs, this mismatch between cache lifetime and session lifetime will only become more expensive.

Session-lifetime caching isn't just a nice-to-have optimization. For teams running agents at scale, it's the difference between a sustainable cost model and a monthly surprise on the API bill. (The break-even math and the cache-write premium trap in Claude Cache TTL Trap.)

FAQ: Cache TTL and AI Agent Costs

What is Anthropic's prompt cache TTL?

5 minutes. The cache starts counting down the moment it's created, and resets to 5 minutes each time it's read. If no request references the cache breakpoint within 5 minutes, the cache is evicted and subsequent requests pay full input token price.

Does OpenAI have a prompt cache TTL?

Yes. OpenAI's prompt caching TTL is 5–10 minutes of inactivity for default in-memory caching, extending up to an hour during off-peak. GPT-4.1 and GPT-5.x support an Extended retention policy with up to 24-hour TTL, but it requires explicit cache breakpoints and manual eviction management — not a drop-in for typical agent workflows.

Can I extend Anthropic's cache TTL?

No. The 5-minute TTL is server-enforced and cannot be configured or extended. The only workaround is sending keep-alive requests every ~4 minutes, which wastes tokens and adds complexity.

What's a good cache TTL for AI agents?

For long-running agent sessions (30 minutes to 4+ hours), the ideal cache TTL is session-lifetime — the cache should persist for the entire duration of the agent's work, regardless of pauses. This typically means hours, not minutes.

How much does a short cache TTL actually cost?

For a 2-hour agent session using Claude Sonnet, the 5-minute cache TTL reduces effective cache hit rate to ~65%. Compared to session-lifetime caching (95%+ hit rate), this costs an extra $13.90 per session. For a team running 10 sessions/week, that's ~$556/month in avoidable costs. See the cost comparison table above for the full breakdown.

Does prompt cache TTL affect Claude Code?

Yes — directly. Claude Code sessions routinely run 30 minutes to 3+ hours. The 5-minute TTL means Claude Code API costs are significantly higher than they would be with session-lifetime caching, because most turns fall outside the cache window.

Synrouter is in Early Access. If you're running agents in production and want session-level caching without building your own proxy infrastructure, click to sign up — we're onboarding users weekly.

See also: Claude Cache TTL Trap: When It Costs More Than It Saves — the break-even math and write-premium trap.