Context is expensive. Every prompt you send to an LLM costs money — and if you're not caching aggressively and trimming context intelligently, you're burning budget on redundant tokens. Here's how we built a system that pays for itself.
Context is expensive. Every prompt you send to an LLM costs money — and if you're not caching aggressively and trimming context intelligently, you're burning budget on redundant tokens.
We learned this the hard way. Six months ago, our monthly AI API bill was climbing faster than our client list. We were sending the same system prompts over and over, re-sending conversation history we'd already processed, and pumping bloated context windows into every API call. The fix wasn't switching models. It was smarter architecture.
Here's how we built a system that cut our API costs by 80% using Redux Toolkit (RTK) and a context-optimization layer called Caveman.
01The Problem: Context is the New Compute
When you're building AI features, the obvious optimization is model pricing. Switch from GPT-4 to GPT-3.5, save money, done.
Except you're not done. The real cost isn't the model — it's the context. Every prompt you send includes:
- System prompt (sent every time)
- Conversation history (grows linearly)
- Retrieved documents (often oversized)
- Tool schemas and descriptions
A "simple" AI feature can easily send 50,000 tokens per request. At $0.01/1K tokens for GPT-4o, that's $0.50 per request. At 10,000 requests/day, that's $5,000/day — just in context overhead.
The solution is twofold: caching and context trimming.
02Step 1: Eliminate Redundant API Calls with RTK Query
Redux Toolkit Query (RTK Query) is a data-fetching library built on top of Redux. It's designed for GraphQL and REST APIs, but here's the part most people miss: it works brilliantly as a caching layer for AI API calls too.
The key insight: many AI API calls are deterministic given the same inputs. If two users ask the same question with the same context, you shouldn't call the LLM twice.
RTK Query gives you this out of the box:
import { createApi, fetchBaseQuery } from '@reduxjs/toolkit/query/react'
import type { RootState } from './store'
const aiApi = createApi({
reducerPath: 'aiApi',
baseQuery: fetchBaseQuery({ baseUrl: '/api/' }),
endpoints: (builder) => ({
queryAI: builder.query({
query: ({ prompt, context }) => ({
url: 'ai/query',
method: 'POST',
body: { prompt, context },
}),
providesTags: (result, error, { prompt }) => [{ type: 'AIQuery', id: prompt }],
}),
}),
})
export const { useLazyQueryAIQuery } = aiApiWith this setup, RTK Query automatically deduplicates identical requests. Call the same prompt with the same context twice? One API call. Call it 100 times? One API call.
For non-deterministic queries — where slight variations in input produce different outputs — you can tune the cache key more precisely. The point is: never pay for the same computation twice.
03Step 2: Trim Context with Caveman
Caching handles repeated identical requests. But what about the context itself? Your prompts are still bloated even if they're unique.
Caveman is a context-optimization library that strips unnecessary tokens from your prompts before they reach the LLM. It's not a model — it's a preprocessor that:
- Deduplicates content within a context window
- Removes low-signal boilerplate (the parts that sound impressive but add nothing)
- Compresses redundant examples in few-shot prompts
- Trims conversation history to the most recent relevant exchanges
Here's how it integrates into your pipeline:
import { optimize } from 'caveman'
const systemPrompt = await optimize(
`You are a helpful assistant for ACME Corp.
You help customers with their orders.
Our products are high quality.
We care about customer satisfaction.
You should be polite and helpful. ...` // 2,000 tokens of boilerplate
)
// Output: "You are a helpful ACME Corp assistant. Be polite and helpful."
// Savings: ~95% of tokens, same semantic meaningFor conversation history, Caveman uses semantic similarity to determine which prior exchanges are actually relevant to the current query. It drops the 80% of history that sounds related but isn't — the rambling early context that nobody needed.
04Putting It Together: The Full Pipeline
Here's the architecture we use at Apptivity for AI features that handle high request volumes:
- User query arrives → RTK Query intercepts it
- Cache check → If identical query exists in cache, return cached response (0 LLM cost)
- Context optimization → Caveman trims the system prompt, conversation history, and retrieved documents
- Token savings → The optimized context is 60-80% smaller than the raw version
- LLM call → Smaller context = lower cost per call
- Cache write → Response stored for future deduplication
async function queryAI(request: AIRequest): Promise<AIResponse> {
// Step 1: Check cache (RTK Query handles this)
const cached = queryAI.useQuery({
prompt: request.prompt,
contextHash: hash(request.context)
})
if (cached.data) return cached.data
// Step 2: Optimize context
const optimizedSystem = await optimize(systemPrompt)
const optimizedHistory = await optimizeHistory(conversationHistory, request.prompt)
const optimizedDocs = await optimizeDocs(retrievedDocuments, request.prompt)
// Step 3: Call LLM with optimized context
const response = await callLLM({
system: optimizedSystem,
history: optimizedHistory,
docs: optimizedDocs,
query: request.prompt,
})
// Step 4: Cache response
return response
}05The Numbers
Here's what we see with clients who implement this architecture:
| Metric | Before | After |
|---|---|---|
| Tokens per request | 48,000 | 9,600 |
| LLM cost per 1K requests | $480 | $96 |
| Cache hit rate | 0% | 35-60% |
| Effective cost reduction | — | ~80% |
The 80% reduction comes from two compounding effects: the cache eliminates redundant calls entirely, and the smaller context windows reduce the cost of every unique call.
06Who This Is For
This architecture is most valuable when:
- You have high request volume (thousands of requests/day)
- Users ask similar questions (customer support, knowledge bases, document Q&A)
- You're building multi-turn conversational features
- Your AI bill is becoming a line item that gets attention in board meetings
If you're running one-off creative tasks, this is overkill. If you're scaling an AI product, the caching layer pays for itself in a week.
07The Bottom Line
Don't optimize the model before you optimize the architecture. A cheap model with bloated context will cost more than an expensive model with smart caching and trimmed prompts.
RTK Query for deduplication. Caveman for context trimming. That's the foundation. The model is step 10, not step 1.
If you want us to walk through your current AI architecture and identify where the token waste is hiding, that's what we do at Apptivity. No pitch — just the audit.