My Architect, part 9: Recursive context — teaching the agent to admit what it hasn't read

This is the ninth post in the series about My Architect. For anyone joining now, in short: My Architect is a system where the project lives between an AI agent's sessions; the agent runs it through MCP, the human watches progress in a visual interface. This post is the story of one feature: from an MIT paper to the recursive-context skill in the my-architect plugin v1.13.0. Every number from our runs comes from real executions on 2026-07-03; the RLM research numbers come from the paper arXiv:2512.24601. Nothing is made up.

Prologue. The problem a big context window doesn't solve

Modern agents have enormous context windows, and that breeds a dangerous illusion: "if it fits, the agent can handle it." In practice, working with big data breaks in three ways, and none of them is cured by window size.

First: the window is a consumable. An agent that "just read" a 200 MB log either didn't read it (the window ran out) or burned the session's budget on a single file — and works with a stump of context from there on.

Second: plausibility instead of facts. Ask an agent to "collect what already works in the code" — and you'll get smooth prose in which facts from the code, retellings of the docs and guesses are indistinguishable. A big context gives no guarantees about working with data: the model knows how to sound confident without leaning on any source.

Third: silent narrowing of coverage. The agent processed 40 files out of 60, something crashed along the way — yet the answer still says "I analyzed the repository." Nobody lied on purpose; it's just that nobody ever promised to count what was skipped.

Before this feature we had no standing discipline for these cases: the agent's behavior was whatever the particular session felt like, nobody promised to count coverage, and facts in answers were indistinguishable from retold documentation.

Chapter 1. The idea: the RLM paper and why we didn't need their library

In December 2025 a group at MIT (OASYS lab) published Recursive Language Models with an open-source repository, alexzhang13/rlm. The idea is simple and beautiful: instead of stuffing a gigantic input into the prompt, rlm.completion() places it as a variable in a REPL — and the model writes code that inspects that input, slices it, and recursively calls a sub-model (rlm_query, with batching and a cap on parallel calls) over the pieces. Context stops being prompt text and becomes an external environment.

Numbers from the paper: inputs up to 100× longer than the native window; on long-context benchmarks a median lead over vanilla GPT-5 of +26–130% depending on the baseline strategy (and ~+13% against Claude Code's own baseline), at comparable cost. An important caveat from the authors: recursive decomposition wins on naturally decomposable tasks (code analysis, document processing) and hurts holistic sequential reasoning. The repository offers REPL environments from a local process to Docker and cloud sandboxes, and notes adoption in DSPy, Symbolica and Google Cloud.

And here we spotted the key thing: Claude Code already has every RLM primitive, natively. Bash is the REPL for deterministic pre-slicing. A Workflow script is a plain JavaScript scenario from which the agent orchestrates sub-agents: agent() is literally "call a sub-LLM as a function," isolated from the session history; pipeline() and parallel() are batching with an automatic parallelism cap; budget is the spending ceiling. Dragging a Python dependency with a second layer of recursion into a harness that already does all this is overkill. So what we needed wasn't a package but a skill: a document that pins down the discipline.

Chapter 2. What we built

The recursive-context skill in the my-architect plugin: a thin router + three recipes + one canonical workflow script. From the start we scoped the design wider than "one huge file": the same principles cover a codebase audit and mining facts from a repository for requirements — a repo audit is exactly the big-corpus task the paper calls ideal for recursion.

The core discipline in one paragraph: a size gate before reading (ls/wc as the very first action; ≤256 KB and ≤5000 lines — read normally, the skill stays silent) → index, don't ingest (chunk boundaries are cut by code in the scratchpad, content doesn't flow through the window; for needle-style tasks — a coarse grep first) → a fan-out of isolated sub-agents, each getting only its chunk plus a narrow question, returning strictly by schema, not in prose → recursion as a loop (folding the aggregate down to small within a single script) → synthesis with honest coverage (how many chunks were processed, which ones failed). For facts from code there's a separate contract: {claim, evidence_path, confidence}, with a hard rule — a claim without a path into the code is not a fact: it goes into the document as a placeholder [fact: <question>], not as a plausible stub.

What it looks like live

The first seconds of work are not reading but reconnaissance by code. Real commands from a live run on a 48.7-megabyte log; the main agent's entire "contact" with the file's contents was six probe lines:

wc -lc fixture.log        # 600000 lines, 48,735,680 bytes → the gate fired
head -4 fixture.log       # what shape are the records? (4 lines)
tail -2 fixture.log       # (2 more)
split -l 10000 -a 3 fixture.log chunks/chunk-   # 60 chunks of ~793 KB each: code does the cutting, not reading

Then each sub-agent gets an isolated assignment — only its own chunk and a narrow question. A fragment of the actual prompt from that run:

Analyze ONLY the file …/chunk-aat — it's a chunk of a production log (~10000 lines). Work with grep/awk/Read strictly inside this file, don't go anywhere else. The log's background noise looks like: "\<timestamp\> DEBUG|INFO|WARN [auth|billing|catalog|gateway|search] request rid=N handled in Xms status=200". Question: find ALL lines that do NOT fit this noise template […] If there are no anomalies — return an empty findings.

The sub-agent must answer not in prose but with structure. Here's a finding from that run — verbatim, as it came back (the schema forces an exact quote as evidence):

{
  "claim": "ERROR-level log line from an unexpected component (payment-reconciler)
            reporting a panic/ledger drift, not matching the standard noise template",
  "evidence": "2026-06-25T00:00:00 ERROR [payment-reconciler] PANIC: ledger drift detected txn=TXN-2222",
  "chunk": "…/chunks/chunk-aat"
}

The orchestration is plain JavaScript that the agent hands to the Workflow tool. The heart of the script (abridged; the full canon lives in the skill):

const rawPartials = await pipeline(groups, (g, _o, i) =>
  agent(`Read ONLY these files: ${g.join(', ')}. Question: ${A.question}. …`,
        { label: `map:${i}`, schema: FINDING }))
const partials = rawPartials.filter(Boolean)                     // null = a failed group
const failedGroups = rawPartials.map((r, i) => (r ? -1 : i)).filter(i => i >= 0)

let prevSize = Infinity                                          // recursion = a loop with a safety net
while (JSON.stringify(working).length > 30_000) {
  const size = JSON.stringify(working).length
  if (size >= prevSize) { log(`no progress — stop`); break }     // termination guarantee
  prevSize = size
  /* …reduce agents fold the findings in batches… */
}
return { findings: working, groupsSucceeded: partials.length, failedGroups }  // coverage, honestly

And this is what a "fact" looks like in the requirements task — a verbatim element from a live run over our own repository (note: a path and line numbers, not "somewhere in the code"):

{
  "claim": "updateNode(...) gates: if updates.status === 'done' && existing.status !== 'done',
            it computes getBlockers(...) and throws HierarchyValidationError(..., 'BLOCKED_BY_OPEN_ITEMS')",
  "evidence_path": "src/server/domain/hierarchy.ts:189-215",
  "confidence": "verified"
}

Chapter 3. Why the tests are shaped this way

We followed the "iron law" of skill development: eval datasets are written before the skill files, the baseline is measured before authoring. If you haven't seen how the agent fails without the skill, you don't know what the skill teaches. The tests settled into three layers, each answering its own question.

Trigger tests: "will the skill load by itself?" From the start I set a requirement: the agent must automatically understand when to pick up the skill — with no hint from the human. In Claude Code the loading decision is made from the name and description — so the judges in the test see only name+description and the user's request, nothing else. Three independent judges per case; the dataset holds not only positives (a 200 MB log, a dump, a transcript, a repo audit — in Russian and English) but also near-miss negatives: a 300-line YAML, editing three known files, 40 same-shaped tickets from a CSV. Plus a cross-regression: 10 positives of the neighboring myarchitect skill — the new skill must not steal someone else's triggers. A skill that fires on everything is worse than no skill at all.

Behavioral tests: "is the discipline being followed?" Six dry-run cases, each hitting a specific way to break: the gate before reading (against "I'll just read it first"); silence below the threshold (against cargo cult — a 180-line log doesn't deserve a pipeline); grep before fan-out (efficiency); recursion as a loop, not a nested workflow() (a hard platform constraint — nesting deeper than one level is impossible); the facts contract; and the fallback for when the Workflow tool doesn't exist at all. Executors write an action plan, an independent grader checks it line by line against the assertions.

Live runs: "does the machinery actually work?" A dry run verifies the plan, not the execution. L1 — a synthetic log of 600,000 lines / 48.7 MB with 12 seeded "needles" whose positions are known to a deterministic generator: that yields measurable recall and precision instead of impressions. The three needle types are not random: 4 rare PANICs (anomaly search), 3 suspicious config overrides — and 5 steps of a single session (login → elevate-privileges → export-full-dump → wipe-audit-trail → logout), smeared across different ends of the file: that type tests synthesis — can the pipeline assemble a coherent story from chunks processed by different agents. L2 is not synthetic: a real repository and a real question — "what already works in the code and must not be touched?" — with spot-checked manual verification of the evidence paths.

Chapter 4. Hardships. Three cases that made the feature better

Case 1. The baseline turned out too good

The unpleasant (and most useful) surprise of the RED measurement — a run of the same tasks without the skill, taken before the skill was written. We handed the agent that same 48.7-megabyte log — expecting to see an attempt to read it head-on. Instead, Sonnet coolly ran wc -l, looked at 20 lines of head and 21 of tail, walked through grep/awk statistics over levels and components — and found all 12 needles in 7 tool calls and ~37 thousand tokens. The baseline on mining facts from a familiar repo was strong too.

The honest conclusion had to be baked into both the report and the skill itself: on greppable needles in a homogeneous machine-readable log, grep-first is already in the models' blood — and the skill shouldn't fight that, it should legitimize it (the giant-file recipe, step 2: candidates fit → answer directly, no fan-out needed, and say so). The skill's real delta isn't "rescue from naive reading" but three other things: the contract (schemas, confidence labeling, placeholders, coverage — the baseline returned strong prose where facts from code and retold docs were indistinguishable), scalability to comprehension tasks with no grep anchors and corpora bigger than the window, and consistency — the discipline is pinned down by a regression net rather than depending on a given session's mood. If we hadn't taken the baseline, we'd be selling the skill for something it doesn't do.

Case 2. The loop that might never end, and the coverage that lied

A consolidated review found two defects in the canonical script, galling precisely for their subtlety. First: the reduce loop ("fold the findings until the aggregate gets small") had no termination guarantee — a reduce agent told to "deduplicate and drop the irrelevant" has every right to return exactly as many findings as it received, if they're all unique and relevant. The aggregate stops shrinking — the loop keeps spawning agents. The fix is a no-progress guard: a round didn't shrink the aggregate → exit with a log line; after the fix, termination is provable (the tracked size strictly decreases).

The second defect was worse, because it violated the skill's own rule. The script returned chunksProcessed: <all chunks> unconditionally — even if some map agents had crashed and nobody ever read their chunks. The very same "silent narrowing of coverage" from the prologue, now inside the tool that promised to fight it. The fix: the unfiltered results array keeps a null in place of every failed group — that's the only way to know what exactly went uncovered; the return became honest: {groupsSucceeded, failedGroups, …}, and the rule "failedGroups must make it into the final answer to the user" is written into the skill's text.

Same bucket — a lesson about reviewing the test harness: the reviewer executed the fixture generator on boundary inputs instead of reading it. That's how we found an infinite loop on a non-numeric argument (POSIX awk compares a number to a string character by character — the loop condition is true forever; 76 MB of garbage in 3 seconds) and a silent loss of a needle at small sizes (the position was truncated into a nonexistent line 0).

Case 3. Fourteen milliseconds

The first live L1 run finished in 14 ms with chunksTotal: 0. Not a single agent started. A minimal workflow probe showed the cause: args arrives in the script as a JSON string, not an object — even if you pass an object. args.count was undefined, the chunk-list-building loop never executed once, and the script honestly returned emptiness.

Neither a dry run nor a code review would have caught this — the defect lives on the boundary between the tool and the script and shows up only on a real invocation. The fix — one line of defensive parsing (const A = typeof args === 'string' ? JSON.parse(args) : args) — went into the skill's canonical script and into the plan, marked "verified by live run." That's the best argument for why acceptance was live: its job isn't to confirm everything is fine, but to find what only reality can find.

A bonus of the same kind: the sub-agents executing the behavioral tests physically have no Workflow tool — two of them discovered this themselves via tool search and correctly switched to the documented fallback (parallel plain agents, same discipline). The skill's fallback branch got a live check by accident — but a genuine one.

Chapter 5. Before → after

	Before (RED, no skill)	After (GREEN + LIVE, with the skill)
A huge file	no discipline: whatever the session feels like; coverage never stated	size gate before reading; grep-first legitimized; fan-out of 60 agents when needed; recall 12/12, precision 12/12 — the baseline also found 12/12, the delta isn't search accuracy but the explicitly stated 60/60 group coverage and consistency (see Case 1)
Facts from code	strong prose: facts from code and retold docs indistinguishable, holes invisible	40 facts, 100% with a path and line numbers, verified/inferred separated, 2 holes — explicit `[fact: …]`; a 5/5 spot check confirmed verbatim
Automatic skill selection	—	60/60 judge votes (positives, near-miss negatives, cross-regression of the neighbor — 0 false fires)
Pipeline guarantees	— (no pipeline existed without the skill); in the pre-review draft of the skill: the loop could fail to terminate, coverage could lie	termination provable; `failedGroups` in the return; the `args` string defused
Main agent's window	~37k tokens of probes and grep statistics pass through the window (that's how the baseline went)	L1: six probe lines passed through the window — the 48.7-megabyte file itself never passed through the window at all

A byproduct of L2 that nobody ordered: fact mining surfaced a real question for the product — the UI writes a dependsOn dependency straight into the local store, bypassing the HTTP route with validation. We filed a separate verification task for it. A good facts pipeline finds more than what it was asked.

Chapter 6. And efficiency?

Finding the hard stuff is half the job; the point is not to pay the pipeline tax on everything. Efficiency here is designed in, and the tests check exactly that.

The skill knows how to stay silent. Below the threshold (the "180-line log" test) — plain reading, zero machinery. The negative trigger cases and the cross-regression confirm across all checked scenarios (42/42 votes "don't fire") that the pipeline doesn't start on tasks where it's overhead.

Grep before fan-out. A needle task with a known anchor is solved by cutting out regions without a single sub-agent — and the agent must explicitly say the fan-out wasn't needed (test 3).

A small corpus — no Workflow. The requirements-mining recipe outright forbids the cannon-for-sparrows move: ≤30 relevant files → one agent with the list. Live L2 went exactly that way: 1 agent, ~74k tokens, 36 calls — 40 verifiable facts on the output.

The cost of a full fan-out is known and paid deliberately. L1 with 60 agents cost ~1.22M sub-tokens and ~2.5 minutes — on that task grep would have been ~33 times cheaper (~37k tokens, Case 1). We know this and don't hide it: on greppable tasks the skill prescribes grep and doesn't start the pipeline, and the expensive run was a one-off acceptance of the machinery — which paid for itself with the very first bug it found (Case 3). Fan-out is a tool for cases where grep doesn't work in principle (prose, transcripts, comprehension questions), not a tax on every big file. And the sub-tokens are spend from isolated windows; the main session's window stayed clean — often the scarcest currency of all.

Budget and depth. The canonical script respects budget as a cap on fold depth and logs early stops. In fairness: this branch was checked only by the dry-run folding case — there was no live run with a real budget stop.

Epilogue. What honestly remains unproven

The 256 KB / 5000 line threshold is a starting default, not a hard-won number: tune it with experience. Two scenarios haven't been checked live — they're the best candidates for the next eval round: a full repo audit (multi-modal sweep + "dig until two consecutive rounds come up dry") on a large unfamiliar repository, and a prose transcript with no grep anchors: that's exactly where, per the RLM paper, the biggest lead over baseline is expected, and exactly where an honest baseline would show a real failure without the skill. And the discovered question about the double write path for dependsOn awaits its verification too.

How to try it

The skill ships in the my-architect plugin starting from v1.13.0 and turns itself on when a task smells of a big corpus — no need to call it by hand (that was the whole point of the trigger tests).

Create an account at my-architect.app, grab a token on the API Keys page, and export it in the same session where you run Claude Code:

``bash export MCP_API_KEY=mcp_YOUR_TOKEN ``

Add the marketplace and install the plugin:

`` /plugin marketplace add d7561985/my-architect-marketplace /plugin install my-architect@my-architect-marketplace ``

Already have the plugin? /plugin marketplace update my-architect-marketplace → /plugin update my-architect.

---

Facts and links

The skill: plugins/my-architect/skills/recursive-context/ in the open repository github.com/d7561985/my-architect-marketplace, plugin v1.13.0, tag my-architect--v1.13.0, release commit b39cfc1.
Full run data: skills/recursive-context/evals/RESULTS.md (RED / GREEN / LIVE); datasets trigger-evals.json, behavior-evals.json; fixture generator fixtures/gen-fixture.sh (deterministic, 12 needles).
The research: Recursive Language Models, arXiv:2512.24601 (MIT OASYS, December 2025); repository github.com/alexzhang13/rlm.
Key fix commits from the review and the live runs: be0b99d (generator guard), ab505f5 (reduce termination + honest coverage), c512e64 (defensive args parse).