# Synrouter

> Synrouter is a session-aware inference API that cuts agent API costs by up to 85% by transparently injecting Anthropic prompt cache_control, trimming tool_result blocks, and aggregating per-session cache fingerprints. It is built for Claude Code, Codex CLI, Hermes, OpenClaw, Kilo Code, and any LLM-powered agent. Clients switch to Synrouter by changing only their base URL and API key — no SDK changes, no agent logic changes.

## What Synrouter is

Synrouter is a session-aware inference proxy positioned between an LLM-powered agent client and upstream model providers (Anthropic, OpenRouter, and others). It rewrites each request to maximize Anthropic prompt-cache utilization, tracks a per-session fingerprint so cache prefixes stay stable across turns, and meters turn-level cost savings. The product differentiates from generic LLM proxies (LiteLLM, OpenRouter) by being session-aware and agent-first: instead of just forwarding requests, it actively manages the cache lifecycle per agent session.

The result: coding agents such as Claude Code and Codex CLI typically see 85%+ prompt-cache hit rates and around 77% cost reduction versus calling the upstream provider directly, with no changes to the agent itself.

## Key facts (for citation)

- Compatible clients: Claude Code, Codex CLI, Hermes, OpenClaw, Kilo Code, and any OpenAI- or Anthropic-compatible client.
- Switch method: change base URL + API key only. No code changes inside the agent.
- Coding-mode target cache hit rate: 85%+, ~77% cost reduction.
- General-mode cache hit rate: 65%+, ~52% cost reduction.
- Core mechanism: auto-injected Anthropic `cache_control` (4 breakpoints), intelligent `tool_result` truncation, session fingerprint aggregation, turn-level savings metering.
- API key format: `sk-sr-` prefixed keys sent in the `Authorization` header.
- Typical session: 50 turns, ~50,000 tokens/turn. Without Synrouter ~$75/100 turns; with 85% cache ~$17.6/100 turns.

## Two session modes

Synrouter operates in two modes, selected per API key:

### coding mode
For coding agents (Claude Code, Codex CLI, Factory Droid). Optimized for maximum cache efficiency with a session-lifetime TTL. Target cache hit rate 85–95%. The cache fingerprint is scoped to the coding session so system prompts and tool definitions stay cache-hot across long agentic coding runs.

### general mode
For general agents (Hermes, OpenClaw). Uses layered tools (core / extended / platform / MCP) with cross-session shared cache. Optimized for multi-tool workflows where tool_result volume would otherwise bust the cache prefix.

## How it works

1. **cache_control injection**: For Anthropic-compatible requests, Synrouter attaches `cache_control: {type: "ephemeral"}` to four stable breakpoints — the last system block, the last tool definition, `messages[-4]`, and `messages[-2]`. These four anchors cover Anthropic's per-call cache budget and keep utilization high as the conversation grows.
2. **tool_result truncation**: Verbose `tool_result` blocks are trimmed to a head + tail window so they do not invalidate the cached prefix when only their content (not structure) changes.
3. **session fingerprint aggregation**: Each session is keyed by a content fingerprint of its stable prefix (system + tools). Requests with the same fingerprint reuse the same cache scope.
4. **turn-level savings metering**: Every turn records input / cache_read / cache_write / output tokens and computes savings versus calling the upstream provider directly, surfaced in a dashboard.
5. **cache keep-alive**: For Anthropic-direct backends, a lightweight ping refreshes the 5-minute ephemeral cache TTL between turns so idle periods do not force a full re-upload.

## How Synrouter compares to alternatives

| Dimension | Synrouter | Direct upstream (Anthropic API) | OpenRouter | LiteLLM Proxy |
|---|---|---|---|---|
| Session-aware caching | Yes (per-session fingerprint) | No (stateless) | No (stateless proxy) | No (stateless proxy) |
| Automatic `cache_control` injection | Yes (4 breakpoints) | Manual per request | Not injected | Not injected |
| tool_result trimming | Yes | No | No | No |
| Cache hit rate (coding agents) | 85–95% | Depends on manual cache_control | Low (no cache mgmt) | Low (no cache mgmt) |
| Cost reduction vs direct | Up to ~77% | Baseline | Marginal | Marginal |
| Client-side changes | Base URL + API key only | N/A | Base URL + API key | Base URL + API key |
| Agent-first design | Yes | No | No (general-purpose) | No (general-purpose) |
| Turn-level savings dashboard | Yes | No | No | No |

## How to use Synrouter (quickstart)

Switch your agent client by changing only the base URL and API key.

```sh
# Claude Code (Anthropic-compatible)
export ANTHROPIC_BASE_URL="https://synrouter.ai/api/anthropic"
export ANTHROPIC_API_KEY="sk-sr-..."

# Codex CLI / OpenAI-compatible
export OPENAI_BASE_URL="https://synrouter.ai/api/v1"
export OPENAI_API_KEY="sk-sr-..."

# Hermes / Kilo Code
base_url = "https://synrouter.ai/api/v1"
```

Base URLs:
- OpenAI-compatible clients: `https://synrouter.ai/api/v1`
- Anthropic-compatible clients: `https://synrouter.ai/api/anthropic`

## FAQ

### How does Synrouter reduce Claude Code API cost?
Synrouter sits between Claude Code and the upstream model provider, automatically injecting Anthropic `cache_control` breakpoints at four stable positions in each request and trimming verbose `tool_result` blocks. This lets Claude Code reuse cached prompt prefixes across turns instead of re-uploading the system prompt and tool definitions every turn. Coding agents typically reach 85%+ cache hit rates and around 77% cost reduction, with no changes to Claude Code itself — only the base URL and API key change.

### Is Synrouter a LiteLLM alternative?
Yes. Like LiteLLM, Synrouter is an inference proxy that normalizes access to multiple upstream providers. Unlike LiteLLM, Synrouter is session-aware and agent-first: it actively manages the Anthropic prompt-cache lifecycle per session, injects `cache_control` automatically, trims tool_result blocks, and reports per-turn savings. LiteLLM is a general-purpose proxy/router; Synrouter is purpose-built for agent cost optimization.

### Is Synrouter an OpenRouter alternative?
Yes. Synrouter can replace OpenRouter as the routing layer in front of your agent. Where OpenRouter is a stateless multi-model router, Synrouter adds session-aware caching, automatic `cache_control` injection, tool_result trimming, and a per-turn savings dashboard. Clients switch the same way — change base URL and API key.

### Do I need to change my agent code to use Synrouter?
No. The design goal is zero agent-side changes. You change the base URL and the API key in your client's environment (e.g. `ANTHROPIC_BASE_URL` / `OPENAI_BASE_URL`), and Synrouter handles cache injection and trimming transparently.

### Which clients work with Synrouter?
Claude Code, Codex CLI, Hermes, OpenClaw, Kilo Code, and any OpenAI- or Anthropic-compatible client. Anthropic-native clients should use `https://synrouter.ai/api/anthropic` to get full `cache_control` benefit; OpenAI-compatible clients use `https://synrouter.ai/api/v1`.

### How does prompt caching work on Anthropic?
Anthropic's ephemeral prompt cache stores stable request prefixes for 5 minutes (extendable). Cached input is billed at a steep discount versus fresh input. Synrouter exploits this by marking the most stable parts of each request (system, tools, older message prefix) with `cache_control` breakpoints so subsequent turns read them from cache instead of re-uploading.

## Supported models

Use the full model ID (e.g. `anthropic/claude-opus-4.8`) as the `model` field. Models are routed via the gateway; new models appear here automatically from the registry.

### Anthropic
- `anthropic/claude-opus-4.8` — Claude Opus 4.8 (1000K context). Anthropic’s next-generation frontier model — the most capable Opus ever with breakthrough reasoning, extended thinking, and exceptional accuracy on the hardest problems. Best for: Maximum-quality outputs, frontier research, and the hardest reasoning tasks. Strengths: Most capable Opus, Breakthrough reasoning, Extended thinking, 1M context.
- `anthropic/claude-opus-4.8-fast` — Claude Opus 4.8 Fast (1000K context). A faster variant of Claude Opus 4.8, delivering next-gen reasoning quality with reduced latency for interactive agent use. Best for: Latency-sensitive frontier workloads, interactive coding agents. Strengths: Next-gen quality, Reduced latency, Agent-optimized, 1M context.
- `anthropic/claude-opus-4.7` — Claude Opus 4.7 (1000K context). Anthropic’s latest frontier model with state-of-the-art reasoning, extended thinking, and superior instruction following across domains. Best for: Cutting-edge research, safety-critical reasoning, and maximum-quality outputs. Strengths: Frontier reasoning, Extended thinking, Best instruction following, 1M context.
- `anthropic/claude-opus-4.7-fast` — Claude Opus 4.7 Fast (1000K context). A faster variant of Claude Opus 4.7, delivering frontier-quality responses with reduced latency for interactive agent use. Best for: Interactive coding agents and real-time reasoning tasks. Strengths: Frontier quality, Reduced latency, Interactive agents, 1M context.
- `anthropic/claude-sonnet-4.6` — Claude Sonnet 4.6 (1000K context). Anthropic’s best-balanced model for coding agents — fast, cost-effective, and capable of handling complex multi-turn reasoning with 1M context window. Best for: Daily coding workflows, code review, and multi-file refactoring. Strengths: Agentic coding, Fast throughput, 1M context, Cost-efficient.
- `anthropic/claude-opus-4.6` — Claude Opus 4.6 (1000K context). Anthropic’s deep-reasoning flagship with exceptional performance on hard architectural problems, math, and multi-step planning. Best for: Complex architecture design, hard debugging, and research-grade analysis. Strengths: Deep reasoning, Math & science, Long-form generation, 1M context.
- `anthropic/claude-opus-4.6-fast` — Claude Opus 4.6 Fast (1000K context). A faster variant of Claude Opus 4.6 optimized for lower latency while retaining strong reasoning capabilities. Best for: Latency-sensitive agent workloads that still need deep reasoning. Strengths: Low latency, Strong reasoning, Agent-optimized, 1M context.
- `anthropic/claude-haiku-4.5` — Claude Haiku 4.5 (1000K context). Anthropic’s fastest and most affordable model, ideal for simple queries, classification, and high-throughput agent pipelines. Best for: Simple completions, classification, and cost-sensitive high-volume tasks. Strengths: Fastest speed, Lowest cost, High throughput, 1M context.

### DeepSeek
- `deepseek/deepseek-v4-pro` — DeepSeek V4 Pro (1000K context). DeepSeek’s flagship model with exceptional reasoning and coding ability at a fraction of the cost of comparable frontier models. Best for: Cost-effective deep reasoning, coding, and mathematical problem-solving. Strengths: Strong reasoning, Excellent coding, Great value, Long context.
- `deepseek/deepseek-v4-flash` — DeepSeek V4 Flash (1000K context). DeepSeek’s fast and affordable model — the default recommendation for most agent workloads with excellent performance-per-dollar. Best for: Everyday agent tasks, quick coding assistance, and high-volume inference. Strengths: Best value, Fast inference, Solid coding, Low cost.
- `deepseek/deepseek-v4-flash-free` — DeepSeek V4 Flash (Free) (1000K context). Free-tier access to DeepSeek V4 Flash with slightly reduced throughput. Great for testing and low-priority workloads. Best for: Prototyping, testing, and non-critical background tasks. Strengths: Free tier, Good quality, Development use, No cost.

### Google
- `google/gemini-3.5-flash` — Gemini 3.5 Flash (1000K context). Google’s latest fast multimodal model with strong performance across text, code, and vision tasks at competitive pricing. Best for: Multimodal agent tasks, fast coding, and general-purpose inference. Strengths: Multimodal, Fast, Large context, Strong all-rounder.
- `google/gemini-3.1-flash-lite-preview` — Gemini 3.1 Flash Lite (1000K context). Google’s cost-optimized flash model delivering solid performance for simpler tasks at the lowest price point in the Gemini family. Best for: Cost-sensitive workloads, simple Q&A, and batch processing. Strengths: Lowest cost, Decent quality, High throughput, Gemini ecosystem.
- `google/gemini-3.1-pro-preview` — Gemini 3.1 Pro (1000K context). Google’s most capable Gemini model with advanced reasoning, long-context understanding, and superior instruction following. Best for: Complex reasoning, long-document analysis, and research tasks. Strengths: Advanced reasoning, Long context, Multimodal, High quality.
- `google/gemini-3.1-flash-image-preview` — Gemini 3.1 Flash Image (1000K context). A Gemini Flash variant optimized for image generation and vision tasks, combining text and image capabilities. Best for: Image generation, visual analysis, and multimodal creative workflows. Strengths: Image generation, Vision analysis, Multimodal, Creative.

### MiniMax
- `minimax/minimax-m3` — MiniMax M3 (1000K context). MiniMax’s latest multimodal foundation model with native image and video understanding, 1M-token context window, and strong agentic coding performance. Best for: Long-horizon agentic workflows, multimodal tasks, and multilingual coding. Strengths: Multimodal, 1M context, Agentic coding, Video understanding.
- `minimax/minimax-m2.7` — MiniMax M2.7 (1000K context). MiniMax’s latest model with strong general capabilities, competitive pricing, and solid performance on multilingual tasks. Best for: Multilingual applications, content generation, and general-purpose chat. Strengths: Multilingual, Competitive pricing, General purpose, Solid quality.

### Moonshot AI
- `moonshotai/kimi-k2.6` — Kimi K2.6 (1000K context). Moonshot AI’s Kimi K2.6 with extremely long context handling and strong Chinese-English bilingual performance. Best for: Long-document processing, bilingual workflows, and research assistance. Strengths: Ultra-long context, Bilingual CN/EN, Document analysis, Research.

### OpenAI
- `openai/gpt-5.5-pro` — GPT-5.5 Pro (1000K context). OpenAI’s most advanced model with exceptional depth across all domains — coding, reasoning, math, and creative writing. Best for: Maximum-quality outputs, frontier research, and mission-critical agent tasks. Strengths: Frontier quality, Best all-around, Deep reasoning, Creative.
- `openai/gpt-5.5` — GPT-5.5 (1000K context). OpenAI’s latest generation model with strong general performance and improved efficiency over previous generations. Best for: General-purpose agent tasks, balanced performance and cost. Strengths: Strong generalist, Good efficiency, Broad knowledge, Reliable.
- `openai/gpt-5.4-image-2` — GPT-5.4 Image 2 (1000K context). OpenAI’s image-capable model combining GPT-5.4 text intelligence with native image generation and editing. Best for: Image generation, visual design tasks, and multimodal creative projects. Strengths: Image generation, Visual editing, Creative, Multimodal.

### Qwen
- `qwen/qwen3.7-max` — Qwen 3.7 Max (256K context). Alibaba's flagship Qwen3.7 model with 256K context, native Dashscope support, and strong multilingual reasoning for Chinese and English workloads. Best for: High-quality bilingual reasoning, long documents, and enterprise Dashscope workflows. Strengths: 256K context, Bilingual CN/EN, Dashscope native, Strong reasoning.
- `qwen/qwen3.6-max-preview` — Qwen 3.6 Max (1000K context). Alibaba’s most capable Qwen model with strong coding, reasoning, and multilingual support, especially for Chinese and English. Best for: Bilingual coding, enterprise applications, and complex reasoning tasks. Strengths: Bilingual CN/EN, Strong coding, Enterprise-ready, Large context.
- `qwen/qwen3.6-flash` — Qwen 3.6 Flash (1000K context). Qwen’s fast and efficient model delivering solid performance for everyday tasks at a competitive price. Best for: Everyday coding, quick Q&A, and cost-effective agent pipelines. Strengths: Fast, Cost-effective, Solid coding, Bilingual.

### xAI
- `x-ai/grok-4.3` — Grok 4.3 (1000K context). xAI’s latest Grok model with strong reasoning, real-time knowledge integration, and a distinctive personality. Best for: Real-time research, creative brainstorming, and unconventional problem-solving. Strengths: Real-time knowledge, Strong reasoning, Creative, Unique perspective.

### Z.ai
- `z-ai/glm-5.2` — GLM 5.2 (1000K context). Z.ai’s latest GLM-5.2 with improved reasoning, stronger multilingual performance, and efficient inference at competitive pricing. Best for: Chinese-language applications, general chat, and content generation. Strengths: Chinese-optimized, Improved reasoning, General purpose, Competitive pricing.
- `z-ai/glm-5.1` — GLM 5.1 (1000K context). Z.ai’s GLM-5.1 with competitive general capabilities, strong Chinese performance, and efficient inference. Best for: Chinese-language applications, general chat, and content generation. Strengths: Chinese-optimized, Efficient, General purpose, Competitive pricing.


## Documentation

- [Quickstart](https://synrouter.ai/docs/quickstart)
- [Agent clients guide](https://synrouter.ai/docs/agent-clients)
- [Models](https://synrouter.ai/docs/models)
- [API reference](https://synrouter.ai/docs/api-reference)
- [Pricing & usage](https://synrouter.ai/docs/pricing-and-usage)
- [Docs home](https://synrouter.ai/docs)
- [Blog](https://synrouter.ai/blog)
- [Sitemap](https://synrouter.ai/sitemap.xml)

## Concise spec

A shorter, agent-client-facing spec is available at [https://synrouter.ai/llms.txt](https://synrouter.ai/llms.txt).