How can I reduce LLM API costs?

Four techniques have the biggest impact: prompt compression (saves 50–80% on system prompt tokens by removing filler and being direct), whitespace trimming (saves 5–15%), stop sequences to prevent over-generation (cuts output 20–40%), and context caching (90% discount on repeated content with Anthropic, ~50% with OpenAI and Google).

What is context caching in LLM APIs?

Context caching lets you pay once to process a long system prompt or document, then reuse the cached result at a steep discount for subsequent requests. Anthropic offers ~90% cache read discount, OpenAI and Google offer ~50%. This is the single highest-leverage optimization for apps with large, repeated system prompts.

How much can prompt compression reduce token costs?

Prompt compression can reduce system prompt tokens by 50–80%. The key techniques are removing filler phrases, using direct imperative language, replacing verbose explanations with structured XML tags, and trimming unnecessary context. Compact JSON formatting also reduces payloads by up to 30%.

Do stop sequences reduce LLM API costs?

Yes. Stop sequences tell the model to stop generating once a specific token or phrase is produced, preventing verbose over-generation. This directly reduces output token usage by 20–40% in tasks where you only need a bounded response (classifications, structured outputs, short answers).

Token Optimization Guide: Reduce LLM API Costs by 50%+

Understanding Token Mechanics

While humans read in words, LLMs process text in statistical chunks called tokens. A token is rarely an entire word—it’s a frequent sequence of characters found during model training (subwords, prefixes, or even single characters).

The standard estimate for English text: 1,000 tokens ≈ 750 words. But density varies significantly by input type:

Prose: ~0.75 ratio (closest to the standard estimate)
Code: Extremely token-dense. Technical syntax, special characters, and indentation can make short functions cost more than long paragraphs.
Non-English languages: Cyrillic, Kanji, and Arabic often cost 2–3x more tokens per word due to different tokenization.

1. Prompt Compression: Be Direct

LLMs don’t need politeness. Replace lengthy system prompts with direct, imperative instructions. Use bullet points instead of paragraphs. Every token removed from your system prompt saves money on every API call.

Before: “Hello! I’m a helpful assistant. When the user asks a question, please provide a detailed and thoughtful answer...”

After: “Answer concisely. Cite sources. Use markdown.”

Saves 50–80% on system prompt tokens.

2. Whitespace & Formatting: Trim the Fat

Extra spaces, newlines, and “pretty-printed” JSON add up. While tokenizers normalize some whitespace, removing unnecessary formatting can save 5–15% on input costs.

{"key": "value", "another": "data"} → {"key":"value","another":"data"}

30% fewer tokens for dense JSON payloads.

3. Stop Sequences: Prevent Rambling

Use stop sequences (e.g., “\n\n”, “</answer>”, “[END]”) to tell the model when to stop. This prevents unnecessary token generation and directly reduces output costs.

Without stop sequence: model generates full response + “Is there anything else I can help with?” + trailing newlines.

With stop sequence (“\nUser:”): model stops exactly at the answer boundary.

Can cut output tokens by 20–40%.

4. Context Caching: Pay Once, Reuse Many

If you repeatedly send the same large context (codebase, documentation, system instructions), enable caching. Anthropic’s prompt caching gives you a ~90% discount on re-reads; OpenAI and Google offer ~50%. For an app that sends a 10,000-token system prompt on every request, caching alone can reduce total API spend by 40–60%.

Without caching: 10,000 input tokens × 1,000 requests/day = 10M tokens/day at full price.

With caching: 10,000 tokens written once + 9,990,000 tokens read at 10% of the input price.

~90% reduction on repeated context for Anthropic models.

To use prompt caching effectively, structure your API calls so the static portion of your context (system instructions, RAG documents, few-shot examples) comes first and stays identical across requests. Dynamic content — the user’s actual message — goes last. Providers cache from the beginning of the prompt, so any change early in the sequence invalidates the cached prefix.

Cache entries have a minimum token threshold (typically 1,024 tokens for Anthropic, 512 for some OpenAI models). Short system prompts won’t qualify — this technique is highest-value for apps with long, stable system prompts, RAG pipelines that prepend the same documents, or multi-turn agents that replay conversation history.

Putting It Together

These four techniques compound. A production application that implements all of them — compressed system prompt, minified data payloads, stop sequences on bounded outputs, and prompt caching for repeated context — can realistically reduce its per-request token cost by 60–80% compared to a naive first implementation.

The right order to implement them:

Start with prompt compression. It costs nothing, takes an hour, and the savings apply to every single call. Treat your system prompt as code — review it for redundancy, cut filler, tighten structure.
Add stop sequences next. If your app generates bounded outputs (classifications, structured responses, short answers), this is a one-line change that immediately cuts output token waste.
Minify inputs before prompting. If you pass JSON, logs, or structured data into your prompts, strip whitespace before the API call. Pair this with output format instructions (“respond in compact JSON”) to tighten both sides.
Implement caching last. It requires the most structural work (reordering your prompt construction, managing cache invalidation), but delivers the largest absolute savings once the first three are in place.

Use ModelMath’s calculator to model the impact of these changes before writing code. Adjust the cache rate slider and input/output ratio to see how your monthly cost changes under different optimization scenarios.

LLM Token Optimization: Reduce API Costs by 50% or More