My Architect, part 10: a repo memory for the agent — a code graph it's not allowed to trust

This is the tenth post in the series about My Architect and a continuation of the ninth — the story of the recursive-context skill that teaches the agent discipline when working with data bigger than the context window. This time: four releases in a single day (v1.14.0 → v1.16.1) that gave the agent a persistent code graph, and a controlled experiment that measured its real value instead of retelling someone else's marketing. Every number from our runs comes from the telemetry of real executions on 2026-07-03; the facts about Graphify come from its README and a live installation; estimates are flagged explicitly. Nothing is made up.

Prologue. An agent with amnesia

An LLM agent has no memory between sessions. Every new conversation about your repository starts from a blank slate: the agent greps again, reads the same files again, builds the same map in its head again — and throws it away at the end of the session. We measured this on our own repository: gathering facts about a codebase we know well cost ~77 thousand tokens and 20 tool calls — and so it goes every session, round and round.

Questions like "what breaks if I change this function's signature?" are a separate misery. The answer requires knowing every consumer of the symbol. Grep finds them, but at the price of a full sweep; and more to the point — the agent has nowhere for that knowledge to outlive the session.

A whole market of "code-as-graph" tools has already grown around this pain — with slogans on the order of "90% token savings" and "an agent 500× smarter." We didn't need slogans; we needed answers to two questions: which classes of errors this removes, and where the savings are real versus where they're populism. Spoiler: at the end of the post there's a table with measured numbers.

Chapter 1. Don't build your own: Graphify

A tool for persistent code memory already exists — Graphify (open source, MIT, ~77k stars): it parses code with tree-sitter, fully locally (36 languages, no API calls, no telemetry), stores a graph of symbols and relationships in graphify-out/graph.json, and drops an interactive HTML map and a ready-made text report about the repository alongside it. Index freshness is maintained by git hooks; for team work there's a merge-driver so the graph can be committed without conflicts.

We deliberately didn't build our own graph engine. Instead we taught our decomposition-discipline skill to recognize an index sitting in the repository and use it correctly. The whole feature is a markdown reference with rules plus a few pointer lines: the plugin gained not a single hard dependency. No graph — everything works as before; no graph but the task is heavy — the agent offers the owner installation once (and does it itself after a "yes"; silent installs are forbidden).

How cheap this is in practice: on our repository (351 files) the graph build took seconds and produced 2655 nodes, 4948 edges and 153 automatically discovered clusters — 6.1 MB on disk, zero API cost. One query afterwards looks like this:

$ graphify explain "getBlockers"
Node: getBlockers()   Source: src/hierarchy/model.ts L419   Degree: 9
  <-- NodePropertiesPanel.tsx [imports]
  <-- validator.ts [imports]
  <-- validateHierarchy() [calls]
  ...

One second — and the agent has every consumer of the function, with paths. What used to require a full sweep of the repository became a query. A detail that matters for the entire economics discussion ahead: the graph itself spends no tokens at all — both the build and the queries are local utilities, not model calls.

Chapter 2. The rule the whole thing was built for

An index is a cache, and a cache knows how to lie: it drifts behind the code between rebuilds. That's why the center of the feature isn't commands but three rules:

The graph is navigation and candidates, NOT a source of facts. A graph answer says WHERE to look; only what a live file confirms becomes a fact. The fact contract ({claim, evidence_path, confidence}) didn't change one iota.
The freshness check is mandatory. The agent's first move is comparing the index build time with the time of the last commit. Stale — say so out loud, offer the owner a rebuild (and if no git hook is installed — offer the hook once too, so the staleness doesn't recur every session), and label candidates "from a stale index."
Verify frugally, by regions. A candidate arrives with path:line — verification reads ±30 lines around it, not the whole file. Without this rule, verification eats everything the graph saved on search.

Chapter 3. How we tested it

A skill is text that has to change the model's behavior. You can't compile text or cover it with a unit test, so the verification is layered, and each layer catches its own class of errors. The iron law of every release: tests are written before the edits, and "how the agent fails without them" gets recorded first.

Layer 1 — triggers: will the skill load on its own. The judges see only the skill's name and description (exactly what Claude Code sees when choosing) plus the request; three independent judges per case; the dataset holds positives in two languages, treacherous near-misses ("rename a variable" — looks like an impact question, but the skill must stay silent) and a cross-check that the new skill doesn't steal the neighboring skill's triggers. Final run: 30/30 votes exactly per the answer key.

Layer 2 — behavior: is the discipline followed. An executor reads the skill from disk and writes a plan for the scenario; an independent grader checks it line by line against pre-recorded assertions. Each scenario hits a specific way to break: fresh graph → queries before grep, but facts from files; stale → notice it and don't trust it; no graph → offer once and don't install silently.

Layer 3 — a hostile review of the text itself. A separate reviewer doesn't read the skill politely — it executes it: runs the shell snippets on boundary inputs, spins up docker, checks every command name against the upstream README.

Layer 4 — live runs: a real installation (with consent), a real graph, a real agent on a real repository.

Layer 5 — the controlled experiment (chapter 5): once the qualitative checks pass, the remaining question is "so how much does it cost?" — and only an A/B with telemetry answers that.

Chapter 4. What we caught before the experiment. Four stories

The conditional-routing trap. We added the branch "no graph → offer the owner a build" — and review showed it was unreachable: every pointer sending the agent into the graph reference began with "if the graph EXISTS." An agent without a graph would simply never open the file containing the offer. A classic mistake: the new branch exists, but every road to it is gated on the state where it isn't needed.

A bug that would have lived only on Linux. The freshness-check snippet used macOS stat syntax with a fallback to the GNU variant. The reviewer spun up docker and showed that with GNU stat the same flag means something else — the command "successfully" prints garbage, and the gate answers STALE forever. On any Linux machine graph-first would never have switched on, silently.

Upstream knows better. We advised hiding graphify-out/ in .gitignore — while Graphify's README recommends teams commit the graph (a shared map; conflicts are handled by their merge-driver). Our advice was silently throwing away the team scenario; now the offer honestly presents the choice. Along the way it turned out the PyPI package is called graphifyy — with two "y"s.

The executor that didn't believe the task statement. A behavioral-test scenario asserted "the graph is fresh" — the agent actually checked the mtime, discovered that the graph in its real environment had gone stale (our own commits had overtaken it), and handled the discrepancy per the recipe. The "don't trust, verify" discipline worked even against the text of its own assignment.

In a separate pass we ran an audit of "what the agent would never figure out on its own" — and closed three gaps: impact questions got their own entry point into the skill (before, "what breaks if I change X?" outside a tracked task didn't lead to the graph at all), the ready-made GRAPH_REPORT.md report is now read first during an audit (the summary is already computed — collecting it via queries would be silly), and with a stale graph and no hook the agent offers the hook.

Chapter 5. The experiment: measure ourselves, don't quote marketing

Design. A large repository unfamiliar to both arms — NestJS, 2125 files (an order of magnitude bigger than ours). Two identical clones: arm A — with a graph (12309 nodes, 22895 edges, 738 clusters; built by tree-sitter in seconds, $0), arm B — without. Three classes of questions, prompts identical up to the path, not a word about the graph (arm A must find it by itself via the skill's discipline), one model (Sonnet), one output contract (≥8 facts with paths + honest coverage + a log of every call). Metrics — tokens and calls from telemetry; quality — fact count and spot-checked path verification.

Results:

Question class	A (graph)	B (no graph)	Delta
Impact: "consumers of ModuleRef, what breaks if get() changes"	71.4k tokens · 26 calls · solo · 12 facts	71.9k · 31 calls · solo · 14 facts	parity (A: −16% calls and time)
Understanding: "HTTP-request lifecycle end-to-end"	75.9k · solo · 13 facts	56.1k on top + 205.8k in hidden sub-agents (measured) ≈ 262k · 21 facts	graph ~3.5× cheaper (−71%)
Map: "packages/core: modules, API, risks"	≈445k (8 sub-agents, estimate*) · 24 facts	≈500k (9 sub-agents, estimate*) · 25 facts	~parity

\* The nested-cost estimate is calibrated on the single precisely measured cell (Q2-B: 205,759 tokens across 5 agents) and flagged as an estimate; the direction of the error is unknown.

Finding 1: the savings are real, but selective. On the "understand a subsystem" class — a measured −71%: not because graph queries are cheaper than grep, but because the graph narrowed the corpus right away, and a single agent coped where arm B was forced to silently unfold five. On the impact class the savings are zero (grep on a literal symbol name is cheap even across 2125 files; the graph's win there is completeness and speed, not tokens). On the full audit — zero (both need orchestration; the graph provides ready-made partitioning and candidates, not a discount). There is no universal "90%" multiplier — there is a class of tasks.

Finding 2, methodological — hidden orchestration. By the top-line numbers, arm B looked cheaper on "understanding" (56k versus 76k) — until we cracked open its transcripts and found 205.8k tokens of nested agents. Pretty "agent with the tool versus agent without" comparisons systematically undercount the arm that quietly delegates. An honest measurement must count the entire agent tree — we suspect a fair share of the popular numbers doesn't bother.

The discipline of distrusting the graph held up inside the experiment too: on the impact question, arm A filtered out the graph's false-positive candidates (barrel imports) with a live grep, and during the audit it verified the graph's cycles and discarded two false ones. The graph provided leads — only what was confirmed became a fact.

A bonus nobody ordered: arm B's audit agents found two plausible real bugs in the NestJS core (an assignment into a Map via index that is never read; a return instead of continue silently truncating middleware registration) — an unplanned confirmation that the discipline's audit depth is genuine.

Limitations — honestly: n=1 per cell (no variance), one repository, one model tier; the nested cost of two cells is estimated by calibration against the single precisely measured one (and flagged as an estimate); the arms chose their own strategy — that's a design feature (we measure the system as a whole), but it conflates "the graph" with "the decision not to orchestrate"; the long-term amortization across repeated sessions wasn't measured by this experiment — it remains a hypothesis with a strong prior (~77k of "re-learning tax" per session against a one-time free build).

Chapter 6. Before → after

	Before	After
Repo knowledge between sessions	doesn't survive the session; ~77k tokens to re-learn	the graph builds in seconds, once; candidates are instant free queries
"Who calls X / what breaks?"	a full grep sweep; outside a tracked task the agent didn't know about the graph	a standalone entry point: `graphify affected "X"` → verify by ±30-line regions — including indirect relationships grep can't see
Understanding a subsystem on a large repo	a hidden fan-out of sub-agents, ~262k tokens	one agent with the graph, ~76k — −71%, measured
Repository audit	reconnaissance from scratch	`GRAPH_REPORT.md` (ready-made hubs, clusters, unexpected relationships) is read first
Trust in the index	—	freshness gate; stale → say so out loud + offer a rebuild and a hook; facts — only from live files
Installation	—	on the owner's explicit "yes": the agent offers once, installs and builds itself; a refusal is respected for the whole session

Epilogue

The feature's formula: an existing tool (Graphify) + a discipline for handling it (our skill) + layered tests that trust nobody — not the text's author, not a pretty snippet, not the premise of their own scenario, and, as chapter 5 showed, not the top line of someone else's benchmark. Of all the substantial findings across the four releases, not one was found by "proofreading" — every one came from execution: judges, graders, docker, live runs and the experiment's telemetry.

Deliberately deferred: pull-request triage via the graph (we have no PR process — we don't build a tool without a consumer), the save-result/reflect query-memory loop (a separate conversation), and a repeat of the experiment with n>1 and a second repository — if the selective savings hold up there too, the "90% hype" will get an honest replacement: "−70% on the 'understanding' class, zero on the rest."

How to try it

The code graph ships in the my-architect plugin starting from v1.14.0 (current — v1.16.1). No need to call it by hand: the skill recognizes a graph in the repository by itself, and on a heavy task without one — offers to install it by itself.

Create an account at my-architect.app, grab a token on the API Keys page, and export it in the same session where you run Claude Code:

``bash export MCP_API_KEY=mcp_YOUR_TOKEN ``

Add the marketplace and install the plugin:

`` /plugin marketplace add d7561985/my-architect-marketplace /plugin install my-architect@my-architect-marketplace ``

You can also build the graph by hand, without waiting for the agent's offer:

``bash pip install graphifyy && graphify build . ``

Already have the plugin? /plugin marketplace update my-architect-marketplace → /plugin update my-architect.

---

Facts and links

The skill reference: plugins/my-architect/skills/recursive-context/references/code-graph.md in the open repository github.com/d7561985/my-architect-marketplace; releases v1.14.0 (f2ca07d), v1.15.0 (1257064), v1.16.0 (ee0a711), v1.16.1 (6f35623), all on 2026-07-03.
Full test and experiment data: skills/recursive-context/evals/RESULTS.md (RED/GREEN/LIVE/A/B); datasets trigger-evals.json (13 cases), behavior-evals.json (10 cases).
Graphify: github.com/safishamsi/graphify (MIT; PyPI package graphifyy); our installation: 0.9.5.
The previous part of the series: My Architect, part 9 — about the decomposition-discipline skill itself.