Our Recall Endpoint Was Taking 5 Seconds. Here's the Sprint That Fixed Most of It.

We market 0Latency as a fast memory layer for AI agents. For a while, our own recall endpoint was taking 4–5 seconds per query. We knew this, we were dogfooding it daily, and it was the most frustrating thing about using the product we were building.

This is a record of five sprints to fix it. Not a polished retrospective — an account of what we found, what we changed, what it cost us, and where the cold path still stands unsolved.

What We Started With

The pgvector HNSW index was fast. Similarity search against 1,979 memories was running at around 60ms — well within acceptable range. The problem wasn't the database. It was everything around it.

Layer	Latency	Status
pgvector HNSW search	~60ms	✅ Fast
Gemini embedding API (cold)	2–3s	❌ Primary bottleneck
Analytics DB retries (broken table)	~4.5s	❌ Blocking critical path
In-process LRU cache	N/A	❌ Broken — multi-worker isolation

The external embedding API call to Gemini was adding 2–3 seconds to every cold recall. Every. Single. One. The analytics layer was adding another 4.5 seconds on top of that when the analytics table was broken — which, for a period, it was. These two problems were compounding each other and neither was visible until we ran dmesg and found 11 OOM kills on the DigitalOcean droplet over the prior 10 days.

Sprint 0: Stop the Bleeding

The OOM kills were happening because we were running the API with 2 gunicorn workers and an in-process LRU embedding cache. Multi-worker processes don't share memory — each worker had its own cache, each was duplicating the embedding model and hit counts, and under load the droplet's 4GB RAM was exhausted. The cache that was supposed to help was making things worse.

Fix: drop to 1 worker. Immediately. The LRU cache now worked correctly because there was only one process. Cache hits dropped to 0ms. Cold recall was still at 2.7 seconds, but the OOM kills stopped and the system was stable.

Single-worker is not a long-term architecture. But it was the right call for that day. The performance we needed to fix was fixable without scale. Fix correctness first, then optimize.

Sprint 1: The 4.5-Second Hidden Tax

With the cache working, we started isolating where the remaining latency was coming from. What we found was embarrassing: the analytics layer had a broken table reference, and every call to is_first_api_call, is_first_memory_recalled, and check_activation_milestone was retrying 3 times on failure before continuing. Three functions, three retries each, every recall query. At ~500ms per retry, that's 4.5 seconds of latency from analytics code that was silently failing anyway.

The fix was wrapping each in a try/except with immediate continue on any exception — no retry, no logging to the critical path, failures invisible to the caller. Four lines of code. The latency impact was immediate and dramatic.

The lesson: Analytics code on the critical path of a real-time API is a ticking clock. It will fail eventually, and when it does, your users pay the retry cost. Move analytics off the critical path before you have users.

Sprint 2: Async Analytics

The try/except fix worked, but it was a band-aid. Sprint 2 built the right solution: emit an event onto Redis pub/sub immediately after the recall response is assembled, and let a separate analytics_consumer.py handle all analytics writes asynchronously. The API response returns to the caller before any analytics write occurs. Consumer failures log silently, never retry, never block.

This also set up the event stream we'd need later for Phase 6 of the self-improving memory system — quality scoring, PostHog instrumentation, and weekly improvement reports all flow through the same pipeline.

Sprint 3 (Planned): Local Embedding Model

The cold path problem — the 2–3 second Gemini round trip for uncached queries — is still unsolved. The architecture is designed: dual embedding columns per memory, write time using Gemini for maximum semantic fidelity (async, latency-insensitive), read time using a local model (all-MiniLM-L6-v2 or gte-small) with no external dependency.

The projected latency profile after Sprint 3:

Percentile	Current	Post-local-model
p50	2–3s	~35ms
p75	2–3s	~45ms
p99	2–3s	~80ms
p99.9	2–3s	~150ms

With 65% of agent queries requiring full vector search (agents generate non-repetitive, analytically complex queries with low cache hit rates), the local embedding model isn't a nice-to-have. It's what determines whether the product has a 35ms p50 or a 2-second one.

Sprint 4 & 5: BM25 Fast-Path

While Sprint 3 is the architecture fix, Sprints 4 and 5 added a fast path for keyword-dominant queries that bypasses the embedding step entirely. The classification logic is simple: if a query contains proper nouns, specific dates, or short exact-match terms (≤5 words, filtered for stop words), route it to a PostgreSQL full-text search using ts_rank against a pre-computed tsvector column. If BM25 returns results above a confidence threshold, return immediately. If not, fall back to vector search.

BM25 search time: 20ms. Query classification time: 2ms. For the queries that match, total recall latency drops from 2–3 seconds to well under 100ms.

-- The fast-path index
CREATE INDEX idx_memories_search_text_gin
  ON memory_service.memories USING gin(search_text);

-- Scoring
ts_rank(search_text, plainto_tsquery('english', %s)) as bm25_score
ORDER BY bm25_score DESC, importance DESC

Current pass rate is 71% on a 7-query test suite. Three failure modes identified: date strings with spaces ("April 2026"), proper nouns with spaces ("Sequoia Capital"), and ISO date format ("2026-03-15"). All three are special character handling issues in plainto_tsquery. These are solvable and on the backlog.

Where We Stand

Keyword queries with BM25 hits: <100ms. Cached semantic queries: 0ms. Cold semantic queries: still 2–3 seconds, still the Gemini round trip, still unsolved until Sprint 3 ships the local embedding model.

We claim sub-100ms recall in our marketing. That's accurate for cached and BM25-routed queries. It's not accurate for cold semantic queries and we don't pretend otherwise. The full latency story is: fast when warm, fast when keyword-dominant, slow when cold and semantic. Sprint 3 eliminates the cold path. Until then, this is where we are.

The broader lesson from this sprint sequence: latency problems in AI infrastructure are rarely where you think they are. Ours wasn't in the vector search. It was in analytics retries, multi-worker cache isolation, and an embedding API call we could have moved off the critical path from day one. Find the actual bottleneck before you optimize. We found ours in dmesg.

Add persistent memory to your agent in minutes.

PostgreSQL + pgvector on DigitalOcean. REST API, MCP server. Free tier: 10,000 memories.

Get Your API Key →