OPTIMIZATION
Context Budget Management: Stop Burning Tokens on Irrelevant Memories
Every token you send to an LLM costs money. With GPT-4o at $2.50 per million input tokens and Claude at $3.00, stuffing your context window with every memory your agent has ever stored is like heating your house by burning cash.
Most AI agents do exactly this. They retrieve a flat list of memories — say, the top 20 by semantic similarity — and dump them all into the system prompt. No prioritization. No budgeting. No awareness of whether those 20 memories are worth the 4,000 tokens they cost.
0Latency solves this with tiered context loading: a three-level system (L0, L1, L2) that loads the right memories at the right cost, keeping your context window lean and your responses sharp.
The Cost of Naive Memory Loading
Let's do the math. Consider an AI assistant with 5,000 stored memories, handling 10,000 queries per day. With a naive "top-20" retrieval approach:
❌ Naive Approach: Top-20 flat retrieval
- • Average memory size: 200 tokens
- • Memories loaded per query: 20
- • Tokens per query from memories: 4,000
- • Daily memory tokens: 4,000 × 10,000 = 40M tokens
- • Monthly cost (GPT-4o input): 40M × 30 × $2.50/1M = $3,000/month
But here's the problem: in our analysis of real production workloads, only 30-40% of retrieved memories are actually relevant to the query. The rest are noise — semantically adjacent but not useful. That's $1,800/month in wasted tokens.
And it gets worse. Irrelevant memories don't just cost money — they actively degrade response quality. LLMs can get confused or distracted by extraneous context, producing longer, less focused responses. You're paying more for worse results.
The L0/L1/L2 Tiered Loading Model
0Latency organizes memory retrieval into three tiers, each with different loading behavior:
L0 — Always Loaded (Identity & Core Facts)
L0 contains memories that should be present in every single interaction. These are P0-priority memories: the user's name, core preferences, critical business rules, compliance requirements.
- Loading: Automatic, every query
- Typical size: 5-15 memories (200-600 tokens)
- Cost: Fixed baseline, always justified
- Example: "User prefers Python over JavaScript", "User is allergic to shellfish", "Company policy: never share customer PII"
L1 — Query-Relevant (Semantic + Temporal Match)
L1 memories are loaded based on the current query. They score high on both semantic similarity and temporal freshness. This is the core working set for most interactions.
- Loading: Dynamic, based on query relevance
- Typical size: 3-8 memories (300-1,600 tokens)
- Cost: Proportional to query complexity
- Example: Recent conversation context, active project details, current task state
L2 — On-Demand (Deep Context)
L2 memories are loaded only when the agent explicitly needs deeper context — multi-turn reasoning, complex questions that require historical background, or graph-expanded queries. They're never loaded by default.
- Loading: Triggered by query complexity or explicit request
- Typical size: 0-10 memories (0-2,000 tokens)
- Cost: Only incurred when needed
- Example: Historical decisions, archived project context, relationship graph context
The Math: Before vs After
Same scenario — 5,000 memories, 10,000 daily queries — but with tiered loading:
✅ Tiered Approach: L0/L1/L2 loading
- • L0 (always): ~400 tokens × 10,000 queries = 4M tokens/day
- • L1 (relevant): ~800 tokens × 10,000 queries = 8M tokens/day
- • L2 (deep, ~15% of queries): ~1,200 tokens × 1,500 queries = 1.8M tokens/day
- • Total daily: 13.8M tokens (vs 40M naive)
- • Monthly cost (GPT-4o): 13.8M × 30 × $2.50/1M = $1,035/month
65% reduction in memory token costs
That's a saving of $1,965/month — or $23,580 per year — on a moderate-scale deployment. For larger deployments with millions of daily queries, the savings scale linearly.
Using Context Budgets in the API
Python
import requests
API_KEY = "your-api-key"
BASE = "https://api.0latency.ai"
# Search with a context budget (in tokens)
response = requests.post(f"{BASE}/memories/search", headers={
"X-API-Key": API_KEY,
"Content-Type": "application/json"
}, json={
"agent_id": "assistant",
"query": "What's the status of Sarah's project?",
"user_id": "sarah-123",
"context_budget": {
"max_tokens": 2000, # Total token budget for memories
"l0_reserve": 500, # Reserve 500 tokens for L0 (always loaded)
"l1_max": 1200, # Up to 1200 tokens for L1 (query-relevant)
"l2_enabled": True, # Allow L2 if budget permits
"prioritize": "relevance" # or "recency" or "balanced"
}
})
result = response.json()
print(f"L0 memories: {len(result['l0'])} ({result['l0_tokens']} tokens)")
print(f"L1 memories: {len(result['l1'])} ({result['l1_tokens']} tokens)")
print(f"L2 memories: {len(result['l2'])} ({result['l2_tokens']} tokens)")
print(f"Total tokens used: {result['total_tokens']}")
print(f"Budget remaining: {result['budget_remaining']}")
# Build your prompt with tiered memories
system_prompt = f"""You are a helpful assistant.
## Core Context (Always Active)
{chr(10).join(m['content'] for m in result['l0'])}
## Current Context
{chr(10).join(m['content'] for m in result['l1'])}
## Background Context
{chr(10).join(m['content'] for m in result['l2'])}
"""
JavaScript
const response = await fetch(`${BASE}/memories/search`, {
method: "POST",
headers: {
"X-API-Key": API_KEY,
"Content-Type": "application/json"
},
body: JSON.stringify({
agent_id: "assistant",
query: "What's the status of Sarah's project?",
user_id: "sarah-123",
context_budget: {
max_tokens: 2000,
l0_reserve: 500,
l1_max: 1200,
l2_enabled: true,
prioritize: "relevance"
}
})
});
const { l0, l1, l2, total_tokens } = await response.json();
// Build tiered prompt
const systemPrompt = `You are a helpful assistant.
## Core Context (Always Active)
${l0.map(m => m.content).join("\n")}
## Current Context
${l1.map(m => m.content).join("\n")}
## Background Context
${l2.map(m => m.content).join("\n")}
`;
Smart defaults: If you don't specify a context_budget, 0Latency uses intelligent defaults: 500 tokens for L0, 1,500 for L1, and L2 disabled. For most agents, these defaults produce excellent results with zero configuration.
How Tier Assignment Works
Memories are assigned to tiers based on a combination of factors:
| Factor | L0 | L1 | L2 |
|---|---|---|---|
| Priority | P0 (Critical) | P1-P2 | Any |
| Semantic score | N/A (always loaded) | > 0.7 | 0.4 - 0.7 |
| Temporal score | N/A | > 0.3 | Any |
| Access count | Frequently accessed | Moderate | Low |
| Loading cost | Fixed (always paid) | Dynamic | On-demand only |
The system also respects token budgets strictly. If your L1 budget is 1,200 tokens and the sixth most relevant memory would push you to 1,400, it's dropped. No overflows. Predictable costs every time.
Real-World Savings by Plan
Here's what typical savings look like across 0Latency plans:
| Plan | Typical Memories | Naive Token Cost/mo | With Tiered Loading | Savings |
|---|---|---|---|---|
| Free (10K memories) | 5,000 | $150 | $55 | 63% |
| Pro ($29/mo, 100K) | 50,000 | $1,500 | $520 | 65% |
| Scale ($99/mo, 1M) | 500,000 | $8,000 | $2,700 | 66% |
Based on 10,000 queries/day at GPT-4o pricing ($2.50/1M input tokens). Actual savings vary by workload and query patterns.
Beyond Token Savings: Quality Improvements
The token savings are compelling, but the quality improvements might matter more. When you load fewer, more relevant memories:
- Responses are more focused — the LLM isn't distracted by tangential context
- Hallucination rates drop — less noise means fewer opportunities to confuse the model
- Latency decreases — fewer input tokens means faster time-to-first-token
- Output tokens decrease — focused context produces concise, direct responses
In practice, tiered loading means the agent spends its token budget on the memories most likely to matter — not on a flat dump of everything it's ever stored.
Advanced: Dynamic Budget Adjustment
For sophisticated agents, you can adjust the context budget dynamically based on query complexity:
# Simple queries need less context
if query_complexity == "simple":
budget = {"max_tokens": 800, "l0_reserve": 400, "l2_enabled": False}
# Complex queries get more budget
elif query_complexity == "complex":
budget = {"max_tokens": 4000, "l0_reserve": 500, "l1_max": 2000, "l2_enabled": True}
# Multi-turn conversations benefit from L2
elif is_multi_turn:
budget = {"max_tokens": 3000, "l0_reserve": 500, "l1_max": 1500, "l2_enabled": True}
Combine this with temporal scoring and graph-enhanced search, and you get a memory system that's intelligent about what to remember, when it's relevant, and how much context to load.
Stop overpaying for context
Tiered memory loading ships on every plan. Start saving tokens today.
Get Your Free API Key →Frequently Asked Questions
Do I have to use context budgets?
No. If you don't specify a context_budget, 0Latency uses sensible defaults (500 L0 tokens, 1,500 L1 tokens, L2 disabled). You can also use the simple limit parameter for flat retrieval if you prefer managing context yourself.
How does 0Latency count tokens?
Token counts use the cl100k_base tokenizer (compatible with GPT-4 and Claude). Memory token counts are pre-computed at storage time, so budget calculations add zero latency to search queries. Token counts are returned in the API response for transparency.
Can I set different budgets for different agents?
Yes. You can configure default context budgets at the agent level via the agent settings endpoint. Per-query budgets override agent defaults. This lets you set conservative defaults for simple agents and generous budgets for complex reasoning agents.
What if my L0 memories exceed the L0 reserve?
If your P0 memories exceed the L0 token reserve, the system loads the most recently accessed P0 memories first and logs a warning. We recommend keeping L0 lean — typically under 15 memories. If you need more, increase the l0_reserve value.
Does context budgeting work with graph-expanded searches?
Yes. When graph_expand is enabled alongside a context budget, graph-related memories are treated as L2 candidates. They're only loaded if the budget permits after L0 and L1 are filled. This keeps graph expansion from blowing up your token costs.