Context is expensive. Every prompt you send to an LLM costs money — and if you're not caching aggressively and trimming context intelligently, you're burning budget on redundant tokens. Here's how we built a system that pays for itself.

Context is expensive. Every prompt you send to an LLM costs money — and if you're not caching aggressively and trimming context intelligently, you're burning budget on redundant tokens.

We learned this the hard way. Six months ago, our monthly AI API bill was climbing faster than our client list. We were sending the same system prompts over and over, re-sending conversation history we'd already processed, and pumping bloated context windows into every API call. The fix wasn't switching models. It was smarter architecture.

Here's how we built a system that cut our API costs by 80% using Redux Toolkit (RTK) and a context-optimization layer called Caveman.

01The Problem: Context is the New Compute

When you're building AI features, the obvious optimization is model pricing. Switch from GPT-4 to GPT-3.5, save money, done.

Except you're not done. The real cost isn't the model — it's the context. Every prompt you send includes:

System prompt (sent every time)
Conversation history (grows linearly)
Retrieved documents (often oversized)
Tool schemas and descriptions

A "simple" AI feature can easily send 50,000 tokens per request. At $0.01/1K tokens for GPT-4o, that's $0.50 per request. At 10,000 requests/day, that's $5,000/day — just in context overhead.

The solution is twofold: caching and context trimming.

02Step 1: Eliminate Redundant API Calls with RTK Query

Redux Toolkit Query (RTK Query) is a data-fetching library built on top of Redux. It's designed for GraphQL and REST APIs, but here's the part most people miss: it works brilliantly as a caching layer for AI API calls too.

The key insight: many AI API calls are deterministic given the same inputs. If two users ask the same question with the same context, you shouldn't call the LLM twice.

RTK Query gives you this out of the box:

import { createApi, fetchBaseQuery } from '@reduxjs/toolkit/query/react'
import type { RootState } from './store'

const aiApi = createApi({
  reducerPath: 'aiApi',
  baseQuery: fetchBaseQuery({ baseUrl: '/api/' }),
  endpoints: (builder) => ({
    queryAI: builder.query({
      query: ({ prompt, context }) => ({
        url: 'ai/query',
        method: 'POST',
        body: { prompt, context },
      }),
      providesTags: (result, error, { prompt }) => [{ type: 'AIQuery', id: prompt }],
    }),
  }),
})

export const { useLazyQueryAIQuery } = aiApi

With this setup, RTK Query automatically deduplicates identical requests. Call the same prompt with the same context twice? One API call. Call it 100 times? One API call.

For non-deterministic queries — where slight variations in input produce different outputs — you can tune the cache key more precisely. The point is: never pay for the same computation twice.

03Step 2: Trim Context with Caveman

Caching handles repeated identical requests. But what about the context itself? Your prompts are still bloated even if they're unique.

Caveman is a context-optimization library that strips unnecessary tokens from your prompts before they reach the LLM. It's not a model — it's a preprocessor that:

Deduplicates content within a context window
Removes low-signal boilerplate (the parts that sound impressive but add nothing)
Compresses redundant examples in few-shot prompts
Trims conversation history to the most recent relevant exchanges

Here's how it integrates into your pipeline:

import { optimize } from 'caveman'

const systemPrompt = await optimize(
  `You are a helpful assistant for ACME Corp. 
   You help customers with their orders. 
   Our products are high quality. 
   We care about customer satisfaction. 
   You should be polite and helpful. ...` // 2,000 tokens of boilerplate
)

// Output: "You are a helpful ACME Corp assistant. Be polite and helpful."
// Savings: ~95% of tokens, same semantic meaning

For conversation history, Caveman uses semantic similarity to determine which prior exchanges are actually relevant to the current query. It drops the 80% of history that sounds related but isn't — the rambling early context that nobody needed.

04Putting It Together: The Full Pipeline

Here's the architecture we use at Apptivity for AI features that handle high request volumes:

User query arrives → RTK Query intercepts it
Cache check → If identical query exists in cache, return cached response (0 LLM cost)
Context optimization → Caveman trims the system prompt, conversation history, and retrieved documents
Token savings → The optimized context is 60-80% smaller than the raw version
LLM call → Smaller context = lower cost per call
Cache write → Response stored for future deduplication

async function queryAI(request: AIRequest): Promise<AIResponse> {
  // Step 1: Check cache (RTK Query handles this)
  const cached = queryAI.useQuery({
    prompt: request.prompt,
    contextHash: hash(request.context)
  })
  if (cached.data) return cached.data

  // Step 2: Optimize context
  const optimizedSystem = await optimize(systemPrompt)
  const optimizedHistory = await optimizeHistory(conversationHistory, request.prompt)
  const optimizedDocs = await optimizeDocs(retrievedDocuments, request.prompt)

  // Step 3: Call LLM with optimized context
  const response = await callLLM({
    system: optimizedSystem,
    history: optimizedHistory,
    docs: optimizedDocs,
    query: request.prompt,
  })

  // Step 4: Cache response
  return response
}

05The Numbers

Here's what we see with clients who implement this architecture:

Metric	Before	After
Tokens per request	48,000	9,600
LLM cost per 1K requests	$480	$96
Cache hit rate	0%	35-60%
Effective cost reduction	—	~80%

The 80% reduction comes from two compounding effects: the cache eliminates redundant calls entirely, and the smaller context windows reduce the cost of every unique call.

06Who This Is For

This architecture is most valuable when:

You have high request volume (thousands of requests/day)
Users ask similar questions (customer support, knowledge bases, document Q&A)
You're building multi-turn conversational features
Your AI bill is becoming a line item that gets attention in board meetings

If you're running one-off creative tasks, this is overkill. If you're scaling an AI product, the caching layer pays for itself in a week.

07The Bottom Line

Don't optimize the model before you optimize the architecture. A cheap model with bloated context will cost more than an expensive model with smart caching and trimmed prompts.

RTK Query for deduplication. Caveman for context trimming. That's the foundation. The model is step 10, not step 1.

If you want us to walk through your current AI architecture and identify where the token waste is hiding, that's what we do at Apptivity. No pitch — just the audit.

How We Cut Our AI API Costs by 80% with RTK and Caveman