Understanding Token Mechanics
While humans read in words, LLMs process text in statistical chunks called tokens. A token is rarely an entire word—it’s a frequent sequence of characters found during model training (subwords, prefixes, or even single characters).
The standard estimate for English text: 1,000 tokens ≈ 750 words. But density varies significantly by input type:
- Prose: ~0.75 ratio (closest to the standard estimate)
- Code: Extremely token-dense. Technical syntax, special characters, and indentation can make short functions cost more than long paragraphs.
- Non-English languages: Cyrillic, Kanji, and Arabic often cost 2–3x more tokens per word due to different tokenization.
1. Prompt Compression: Be Direct
LLMs don’t need politeness. Replace lengthy system prompts with direct, imperative instructions. Use bullet points instead of paragraphs. Every token removed from your system prompt saves money on every API call.
Before: “Hello! I’m a helpful assistant. When the user asks a question, please provide a detailed and thoughtful answer...”
After: “Answer concisely. Cite sources. Use markdown.”
Saves 50–80% on system prompt tokens.
2. Whitespace & Formatting: Trim the Fat
Extra spaces, newlines, and “pretty-printed” JSON add up. While tokenizers normalize some whitespace, removing unnecessary formatting can save 5–15% on input costs.
{"key": "value", "another": "data"} → {"key":"value","another":"data"}
30% fewer tokens for dense JSON payloads.
3. Stop Sequences: Prevent Rambling
Use stop sequences (e.g., “\n\n”, “</answer>”, “[END]”) to tell the model when to stop. This prevents unnecessary token generation and directly reduces output costs.
Without stop sequence: model generates full response + “Is there anything else I can help with?” + trailing newlines.
With stop sequence (“\nUser:”): model stops exactly at the answer boundary.
Can cut output tokens by 20–40%.
4. Context Caching: Pay Once, Reuse Many
If you repeatedly send the same large context (codebase, documentation, system instructions), enable caching. Anthropic’s prompt caching gives you a ~90% discount on re-reads; OpenAI and Google offer ~50%. For an app that sends a 10,000-token system prompt on every request, caching alone can reduce total API spend by 40–60%.
Without caching: 10,000 input tokens × 1,000 requests/day = 10M tokens/day at full price.
With caching: 10,000 tokens written once + 9,990,000 tokens read at 10% of the input price.
~90% reduction on repeated context for Anthropic models.
To use prompt caching effectively, structure your API calls so the static portion of your context (system instructions, RAG documents, few-shot examples) comes first and stays identical across requests. Dynamic content — the user’s actual message — goes last. Providers cache from the beginning of the prompt, so any change early in the sequence invalidates the cached prefix.
Cache entries have a minimum token threshold (typically 1,024 tokens for Anthropic, 512 for some OpenAI models). Short system prompts won’t qualify — this technique is highest-value for apps with long, stable system prompts, RAG pipelines that prepend the same documents, or multi-turn agents that replay conversation history.
Putting It Together
These four techniques compound. A production application that implements all of them — compressed system prompt, minified data payloads, stop sequences on bounded outputs, and prompt caching for repeated context — can realistically reduce its per-request token cost by 60–80% compared to a naive first implementation.
The right order to implement them:
- Start with prompt compression. It costs nothing, takes an hour, and the savings apply to every single call. Treat your system prompt as code — review it for redundancy, cut filler, tighten structure.
- Add stop sequences next. If your app generates bounded outputs (classifications, structured responses, short answers), this is a one-line change that immediately cuts output token waste.
- Minify inputs before prompting. If you pass JSON, logs, or structured data into your prompts, strip whitespace before the API call. Pair this with output format instructions (“respond in compact JSON”) to tighten both sides.
- Implement caching last. It requires the most structural work (reordering your prompt construction, managing cache invalidation), but delivers the largest absolute savings once the first three are in place.
Use ModelMath’s calculator to model the impact of these changes before writing code. Adjust the cache rate slider and input/output ratio to see how your monthly cost changes under different optimization scenarios.