We Built Self-Improving Memory Into Our Agent Infrastructure. Here's What Six Phases Actually Looked Like.

By the time Thomas — our Chief of Staff agent — had accumulated 1,979 memories, we had a new problem. Not too few memories. Too many, with no mechanism to clean them up. The same fact extracted from slightly different conversation angles. Decisions that had been superseded but still lived in the namespace as if they were current. The memory layer was growing, but not improving.

We'd always known consolidation would need to happen. We just hadn't built it. In early April we did — five days, six phases, and one rule that shaped every architectural decision along the way.

The Prime Directive

Before writing a single line of consolidation code, we established something we called the Prime Directive: a false negative is always better than a false positive. In memory consolidation terms — missing a merge is recoverable. Executing a bad merge corrupts the knowledge graph in ways that are hard to detect and harder to undo.

This framing turned out to matter more than any algorithm. Every threshold, every circuit breaker, every manual review gate in what followed was downstream of this one principle. When we were uncertain, we did nothing. When we had high confidence, we acted — but conservatively.

Phase 1: The Queue

Before you can consolidate memories, you need to know which ones are candidates. Phase 1 built the observation layer: a similarity scanner that ran pgvector cosine similarity across the thomas namespace, found pairs above a 0.82 threshold, and inserted them into a consolidation_queue table — without touching anything.

Read and queue only. No modifications to existing memories. This was deliberately conservative. We needed to observe the system before acting on it.

CREATE TABLE consolidation_queue (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  agent_id TEXT NOT NULL,
  memory_id_a UUID NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
  memory_id_b UUID NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
  similarity_score FLOAT NOT NULL,
  consolidation_type TEXT,
  status TEXT DEFAULT 'pending',
  created_at TIMESTAMPTZ DEFAULT NOW(),
  processed_at TIMESTAMPTZ,
  UNIQUE(memory_id_a, memory_id_b)
);

Running the scanner against Thomas's namespace for the first time surfaced 847 candidate pairs above the 0.82 threshold. That's a lot — but that's what 1,979 memories accumulated over months looks like when you actually look at them.

Phase 2: Classification

Not every similar pair is a duplicate. "Palmer McCutcheon is the Director of Engineering at ZeroClick" and "Palmer is connected to Ryan Hudson at ZeroClick" are semantically close but distinct facts. Phase 2 added an LLM classification pass to label each pair as DUPLICATE, RELATED, CONTRADICTION, or SUPERSEDED.

The classifier prompt took three iterations to get right. Early versions were too aggressive — labeling RELATED pairs as DUPLICATE because the surface content overlapped. The fix was adding explicit examples of what each category looks like in context, and requiring the classifier to output a confidence score and reasoning string, not just a label.

After classification, 102 pairs came back labeled DUPLICATE with classification confidence scores. The distribution mattered: most were clustered above 0.90, a few sat in the 0.70–0.85 range where the classifier was clearly uncertain.

Phase 3: The Feedback Loop

A classifier that runs once and never improves is better than nothing. A classifier that improves from feedback is what we actually wanted to build. Phase 3 wired a recall_feedback table so that every time the recall endpoint returned results, the system could record whether those results were useful — and eventually feed that signal back into consolidation confidence scoring.

There's a cold-start constraint worth naming: until you have more than 50 feedback records, classification confidence is the only signal available. We designed the system to use feedback weighting only after that threshold is crossed, because blending in noisy early feedback degrades the classifier faster than it improves it.

Phase 4: The Go/No-Go Gate

This is where the Prime Directive became concrete. Before any merge executes — before a single memory is modified — the system runs a P90 query against the classified queue:

SELECT
  consolidation_type,
  COUNT(*) AS n,
  PERCENTILE_CONT(0.9) WITHIN GROUP (
    ORDER BY classification_confidence
  ) AS p90
FROM consolidation_queue
WHERE status = 'classified'
  AND consolidation_type = 'DUPLICATE'
GROUP BY consolidation_type;

The decision table is binary. P90 below 0.85: no merges execute, prompt tuning required. P90 between 0.85 and 0.90: maximum 5 merges per run, conservative mode. P90 at or above 0.90: up to 10 merges per run. When we ran this against Thomas's namespace for the first time, P90 came back at 0.90 exactly with 12 qualifying duplicates. The gate opened.

The floor of 0.85 isn't arbitrary. It came from manually reviewing low-confidence pairs and finding that anything below that threshold had a meaningful rate of misclassification — pairs that looked like duplicates at the embedding level but were actually distinct facts that happened to use similar phrasing.

Phase 5: The Archival Safety Loop

The merge itself is a five-step transaction. Read both source memories. Generate a consolidated version. Write the new memory. Archive both originals (with a pointer to the consolidated memory). Mark the queue entry as processed. If any step fails, the transaction rolls back. The originals stay intact.

We added a circuit breaker: if more than 3 merges in a single run produce anomalous results — content length drops more than 40% from what either source contained, or the new memory's embedding drifts more than 0.15 from both originals — the run halts. Not pauses. Halts, and requires manual review before continuing.

The first full run against Thomas's namespace executed 7 merges. All 7 passed the circuit breaker checks. None triggered anomaly detection. The consolidated memories were coherent and the archived originals remained intact and queryable if needed.

Phase 6: Weekly Self-Improvement Reports

Phases 1–5 build the consolidation engine. Phase 6 makes it self-reporting. Atlas (our data agent) runs a weekly query against the consolidation log and surfaces: how many pairs were queued, how many were classified, how many were merged, what the P90 trend looks like over time, and whether the feedback loop has accumulated enough records to start influencing confidence scoring.

The goal isn't just a dashboard. It's a feedback mechanism that tells us when the classifier needs tuning, when the similarity threshold needs adjusting, and when the system is actually getting better at knowing what it knows.

What 7 Merges Taught Us

Seven merges out of 847 candidate pairs sounds conservative. It is. That's exactly what we wanted for a first run against a live namespace with 1,979 memories that we depend on daily. The P90 gate, the 5-step archival loop, the circuit breaker — all of it is designed so that the first time the system acts, it acts on only the highest-confidence cases.

As feedback accumulates and the classifier improves, the merge volume will grow. But the architecture doesn't change: observe first, classify with confidence scoring, gate on P90, archive before deleting, halt on anomalies. The Prime Directive stays constant regardless of how many merges per run we eventually allow.

Memory that improves over time isn't just a product feature. For an agent running a real business, it's the difference between a knowledge base that degrades slowly and one that gets sharper the longer it runs. We're three weeks into watching Thomas's namespace consolidate. So far it's working exactly as designed — which, in this kind of system, is the only metric that matters.

Add persistent memory to your agent in minutes.

PostgreSQL + pgvector infrastructure, REST API, MCP server. Free tier includes 10,000 memories.

Get Your API Key →