← Back to blog

Your Agent's Tool Outputs Are Wasting 60% of Your Token Budget

Synrouter Team12 min read
tool-outputstoken-optimizationagent-architecturecontext-windowcost-optimization

Last week we watched a Claude Code session where a developer asked the agent to "add rate limiting to the auth middleware." Standard task. The agent ran grep to find the right file, cat to read it, wrote the implementation, ran typecheck, ran tests.

50 turns. By turn 30, something had gone quietly wrong: the agent was carrying 60,000 tokens of accumulated tool output in its context. npm install logs from turn 5. A 4,000-line typecheck output from turn 12. ANSI escape codes and progress spinners from a dozen other commands.

The LLM was reading all of it. Every turn. And being billed for every token.

This isn't an edge case. It's every agent session you've ever run.


The Tool Output Problem, Quantified

In a chat application, tool calls are rare. In an agent session, they're the entire point. Here's the anatomy of a single turn in a Claude Code coding session:

text
1TURN 23: "Fix the type error in middleware/auth.ts"
2
3Input tokens sent to the model:
4┌────────────────────────────────────────────────────────────┐
5│ System prompt + tool schemas ~12,000 tokens │
6│ Conversation history (turns 1-22) ~45,000 tokens │
7│ of which tool outputs: ~32,000 tokens ←── │
8│ Current user message ~200 tokens │
9├────────────────────────────────────────────────────────────┤
10│ TOTAL ~57,200 tokens │
11│ TOOL OUTPUT % 56% │
12└────────────────────────────────────────────────────────────┘

Over half the context window at turn 23 is just accumulated tool output from earlier turns. None of it changed between turn 22 and turn 23. But because the API is stateless, all of it gets re-sent and re-billed.

And tool output isn't just large — it's mostly noise.


What's Actually in Your Tool Outputs

Let's look at five real tool outputs from actual agent sessions and measure what fraction is useful:

1. npm install — 99% noise

text
1$ npm install express-rate-limit
2
3npm warn deprecated inflight@1.0.6: This module is not supported...
4npm warn deprecated @humanwhale/object-scan@4.0.2: ...
5⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦
6
7added 47 packages, removed 12 packages, changed 8 packages
8audited 847 packages in 8s
9
10174 packages are looking for funding
11 run `npm fund` for details
12
13found 0 vulnerabilities

Actual useful information: "added 47 packages, removed 12 packages, found 0 vulnerabilities" — about 120 characters. The rest — deprecation warnings, the animated spinner (captured as literal ⠙ ⠹ sequences), funding nag — is pure context waste. Useful fraction: ~2%.

2. typecheck (TypeScript) — 95% noise

text
1$ npx tsc --noEmit
2
3src/middleware/auth.ts:47:12 - error TS2345: Argument of type 'string | undefined'
4is not assignable to parameter of type 'string'.
5 Type 'undefined' is not assignable to type 'string'.
6
747 validateToken(req.headers.authorization)
8 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9
10src/middleware/rate-limit.ts:23:8 - error TS2304: Cannot find name 'RateLimiter'.
11
1223 const limiter = new RateLimiter(config);
13 ~~~~~~~~
14
15Found 2 errors in 2 files.
16
17Errors Files
18 2 src/middleware/auth.ts:47
19 1 src/middleware/rate-limit.ts:23

The first error is exactly what the agent needs to see. The second is noise from a file it hasn't touched yet. The "Errors Files" table at the bottom is a reformatted summary of the exact same data. Useful fraction: ~40% — better than npm, but still half waste. And a large project can produce 300+ lines of type errors that are 90% irrelevant to the current task.

3. grep in a large codebase — 80% noise

text
1$ grep -r "RateLimiter" src/
2
3src/middleware/rate-limit.ts:import { RateLimiter } from 'express-rate-limit';
4src/middleware/rate-limit.ts:const limiter = new RateLimiter(config);
5src/routes/api.ts:import { RateLimiter } from '../middleware/rate-limit';
6src/tests/rate-limit.test.ts:import { RateLimiter } from '../middleware/rate-limit';
7node_modules/express-rate-limit/dist/index.d.ts:export interface RateLimiterConfig { ... }
8node_modules/express-rate-limit/dist/index.d.ts:export class RateLimiter { ... }
9node_modules/express-rate-limit/README.md:### RateLimiter Options
10node_modules/express-rate-limit/README.md:new RateLimiter({
11... [+47 more lines from node_modules and dist files]

The agent asked for src/ — and got node_modules/ results too because grep -r is recursive. The developer knows to ignore node_modules/ hits; the LLM has to read through all of them to figure that out. Useful fraction: ~20%. The 47 node_modules lines add zero value but consume ~3,000 tokens that will persist in context for the rest of the session.

4. cat on a large file — 90% noise (but you need some of it)

text
1$ cat src/middleware/auth.ts
2
3[200 lines of middleware code]

The agent asked to read a file to understand the auth middleware. It probably needs lines 40-60 (where the token validation logic lives), maybe the imports at the top, and maybe the exports at the bottom. The 140 lines in between — database queries, error formatting, utility functions — are context that the LLM will read once and never reference again. But they don't disappear. They sit in context for 30 more turns. Useful fraction: ~25% — but the structure matters more than the line count.

5. curl / HTTP responses — 70% noise

text
1$ curl -s https://api.example.com/health
2
3HTTP/1.1 200 OK
4Date: Mon, 26 May 2026 10:15:00 GMT
5Content-Type: application/json
6Content-Length: 156
7Connection: keep-alive
8Server: nginx/1.25.3
9X-Request-ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
10X-RateLimit-Remaining: 999
11X-RateLimit-Reset: 1716718500
12Cache-Control: no-cache
13
14{"status":"ok","version":"2.4.1","uptime":142937,"db":"connected"}

The JSON body is 72 characters. The headers are 380. And the headers contain a timestamp that will be stale in 10 seconds, a request ID that will never be referenced, and rate limit counters that don't matter. Useful fraction: ~20%.


The Cumulative Effect: A 50-Turn Session

Here's what these tool outputs add up to in a real session. We instrumented 15 Claude Code sessions (median 47 turns) and measured tool output accumulation:

text
1TOOL OUTPUT ACCUMULATION OVER A TYPICAL SESSION
2
3Turn Tool output accumulated Useful portion Waste
4──── ──────────────────────── ───────────── ─────
5 5 8,200 tokens 2,100 6,100
610 18,500 tokens 4,800 13,700
720 41,000 tokens 9,200 31,800
830 62,000 tokens 12,400 49,600
940 84,000 tokens 15,100 68,900
1047 96,000 tokens 17,300 78,700
11
12By turn 47: 82% of accumulated tool output is waste.
13This waste is re-sent on EVERY subsequent turn.

96,000 tokens of tool output. 78,700 of them are noise. And every one of those tokens gets re-billed on turns 48, 49, and 50 — along with all the other context. With Claude Sonnet at $3/$15 per million input/output tokens, those 78,700 tokens of waste cost roughly $0.24 every turn they're in context. Over the last 20 turns of the session, that's $4.80 in pure waste — from tool outputs alone.


Why Nobody Fixes This (It's Harder Than It Looks)

"Just strip ANSI codes" sounds easy. But real tool output is messy in ways that generalized parsing can't handle:

Format diversity. Every CLI tool formats its output differently. npm uses spinners. tsc uses structured error tables. curl mixes headers and body. git uses color diffs. docker uses progress bars. A regex that works for npm install breaks on docker build.

Context sensitivity. Sometimes the ANSI codes are the information. A git diff output loses all meaning if you strip color codes without preserving the +/- markers. Sometimes an npm warning about a deprecated package is exactly what the agent needs to see.

The "current turn" trap. You can't just strip everything. The tool output from the current turn is what the agent is actively reasoning about. Truncate it, and you amputate the agent's working memory. Only historical tool output — data from turns 1-46 while the agent is working on turn 47 — is safe to compress.

Every framework does it differently. Claude Code wraps tool outputs in Anthropic-format tool_result content blocks. Codex CLI uses OpenAI-format tool role messages. Cursor does its own thing. Each format requires different parsing. Most teams give up and accept the waste.

The result: almost nobody compresses tool outputs. They just let them pile up, turn after turn, paying the full token price for data the model read 20 turns ago and will never need again.


The Head+Tail Strategy: Keeping What Matters, Dropping What Doesn't

Synrouter takes a different approach. Instead of trying to understand the content of every tool output (which requires language-specific parsers and breaks on edge cases), we apply a structural rule that works across every format:

text
1THE HEAD+TAIL STRATEGY
2
3Raw tool output (12,000 tokens):
4┌────────────────────────────────────────────────────────────┐
5│ [HEAD — first 1,500 tokens] │
6│ • Usually contains: command being run, early output, │
7│ initial context, error locations │
8├────────────────────────────────────────────────────────────┤
9│ [... synrouter truncated 9,000 tokens ...] │ ← Sentry marker
10├────────────────────────────────────────────────────────────┤
11│ [TAIL — last 500 tokens] │
12│ • Usually contains: final results, exit codes, error │
13│ summaries, "Found 2 errors" type conclusions │
14└────────────────────────────────────────────────────────────┘
15
16Kept: 2,000 tokens (the LLM-readable portions)
17Dropped: 10,000 tokens (the middle, which is usually repetitive noise)
18Savings: 83%

Here's why this works. Most CLI tool output follows a predictable structure:

  • The head tells you what's happening: the command that was run, the first few results, the context. For tsc, it's the first 2-3 errors with their file locations. For npm install, it's the package additions and removals.
  • The tail tells you the outcome: the summary, the exit code, the final line. For tsc, it's "Found 2 errors." For git, it's the final status line.
  • The middle is repetition. For grep, it's the 47th through 100th matches — all formatted identically. For npm, it's the spinner animation and funding messages. For large files, it's the middle 80% of the file that the agent doesn't need to re-read.

The truncation marker is explicit and LLM-readable:

[... synrouter truncated 9,000 tokens ...]

This tells the model: "data was here, it's been removed, you don't need it." Models trained on large contexts handle these markers naturally — they treat them the same way you'd treat an ellipsis in a quoted passage.


The Rules: What Gets Trimmed and When

Synrouter's tool output trimming follows three simple rules:

| Condition | Action | |-----------|--------| | Output ≤ 4,000 tokens | Untouched. Short outputs are usually entirely relevant. | | Output 4,000–20,000 tokens | Head 1,500 + truncation marker + tail 500. | | Output > 20,000 tokens | Same head+tail trimming, plus truncated_tokens metadata in the API response headers so you can track exact savings. |

And one critical safety rule: the most recent user message is never touched. If the agent just ran cat auth.ts and you're asking it to fix line 47, that full output stays intact. Only historical tool outputs — messages from earlier turns that are now just context — get trimmed.

This works across both Anthropic-format (tool_result content blocks) and OpenAI-format (tool role messages). Your agent framework doesn't need to know it's happening.


Real Numbers: A Session With and Without Trimming

We ran the same Claude Code session twice — a developer building a payment webhook handler over 55 turns. Once through the raw Anthropic API, once through Synrouter with tool output trimming enabled.

text
1SESSION: "Add Stripe webhook handler to the billing API"
255 turns, Claude Sonnet 4.6
3
4 Raw API With Trimming Savings
5 ──────── ───────────── ───────
6Total input tokens 2,840,000 1,620,000 43%
7Tool output tokens 1,470,000 438,000 70%
8Non-tool context 1,370,000 1,182,000 14%
9Total API cost $44.20 $26.80 39%
10
11Savings from tool trimming alone: $17.40 on one session.

The tool output trimming dropped the tool output portion of the context from 1.47M tokens to 438K — a 70% reduction. The non-tool context also shrank slightly because shorter prompts mean less context window churn.


Combined With Session Caching: The Full Picture

Tool output trimming and session-lifetime caching are complementary optimizations:

text
1THE TWO OPTIMIZATION LAYERS
2
3Layer 1: Tool Output Trimming
4 Problem: Tool outputs are 80% noise, pile up over turns
5 Solution: Head+tail truncation on historical tool results
6 Impact: 50-70% reduction in tool output token volume
7
8Layer 2: Session-lifetime Caching
9 Problem: 85-95% of tokens per turn are duplicates
10 Solution: Session-scoped caching that survives pauses
11 Impact: 85-95% cache hit rate (vs 50-65% with 5-min TTL)

Together, they transform the economics:

text
1FULLY OPTIMIZED 55-TURN SESSION
2
3Scenario Cost vs Baseline
4──────── ──── ───────────
5Raw Anthropic API (no caching, no trimming) $44.20 —
6Anthropic 5-min TTL only $28.70 −35%
7+ Synrouter tool trimming $17.90 −60%
8+ Synrouter session cache $9.80 −78%

From $44.20 to $9.80. Same session, same agent, same results. 78% reduction.

The tool trimming alone saves 35% relative to Anthropic's native caching. It's not a minor optimization — it's a third of your API bill that you're currently paying for characters the model doesn't need to read.


What About OpenAI?

Everything above uses Anthropic examples because Claude Code is the most transparent agent framework to instrument. But the same problem exists — and is arguably worse — with OpenAI-compatible agents.

OpenAI's API uses a different format for tool results (the tool role), but the content is identical: npm install output is npm install output regardless of which JSON wrapper it's in. Codex CLI, Cursor, and any agent built on the OpenAI SDK all carry the same accumulated tool noise in their context windows.

Synrouter's trimming works on both formats. The strategy is identical; only the JSON path to the tool output content differs.


This Isn't Compression. It's De-Noising.

There's an important philosophical difference between tool output trimming and general-purpose text compression. Compression algorithms try to reduce the byte count of arbitrary text while preserving the ability to reconstruct it perfectly. That's not what we're doing.

Tool output trimming is de-noising. We're not trying to preserve every character — we're trying to identify and discard the characters that contribute zero reasoning value. The ANSI color codes, the progress spinners, the 47th identical grep match in node_modules/. These aren't compressed; they're removed, because they were never useful in the first place.

The goal isn't "make the context smaller." The goal is "make the context denser with signal." A 2,000-token trimmed tool output that contains the error location and exit code is more useful to the LLM than a 12,000-token raw output where the signal is buried in noise.


Getting Started

Tool output trimming is transparent. You don't configure it, tune it, or write rules for it. It happens server-side for every request that passes through Synrouter.

bash
1# Your agent code — completely unchanged
2# Claude Code:
3export ANTHROPIC_BASE_URL="https://synrouter.ai/api/anthropic"
4
5# Codex CLI / any OpenAI-compatible agent:
6export OPENAI_BASE_URL="https://synrouter.ai/api/v1"
7export OPENAI_API_KEY="sk-sr-..."

Your agent sends tool calls. Synrouter forwards them. When the results come back full of ANSI codes and duplicate log lines, Synrouter trims the historical ones before they enter your context window — and before you're billed for them on every subsequent turn.

If you want to see exactly what's being trimmed, every Synrouter API response includes x-synrouter-trimmed-tokens in the headers, showing the token count removed from that request's historical tool outputs. It's fully observable. You can audit every byte.

Synrouter is in Early Access. Sign up to get your API key. First $5 in credits are free, no credit card required.


Read next: The 5-Minute TTL: How Anthropic's Prompt Cache Quietly Broke Long-Running Agents

Read next: How to Cut Claude Code API Costs by 85%