Recursive Language Models: Why Your AI Agent Forgets
Table of Contents
Your AI coding assistant handles the first three files flawlessly. By the twelfth file, it starts missing imports. By the twentieth, it is hallucinating function names that never existed. You may be seeing context rot: the decline in recall and reasoning quality as the amount of input grows. A larger context window can delay the problem, but it does not automatically solve it. Researchers at MIT propose a different approach: let the model manage its own context instead of trying to absorb everything at once.
The Problem Isn't the Window — It's the Approach
Many long-context language model workflows can suffer from context rot. As you feed more tokens into a model's context window (the text it can process at one time), its ability to accurately recall and reason over that information degrades. This isn't a fringe phenomenon — Anthropic documents it explicitly, and any engineer running long Claude Code sessions or extended ChatGPT conversations has experienced it firsthand.
The industry's response has been to expand context windows: from 4,000 tokens to 128,000, then to 1 million and beyond. But larger windows often delay the problem rather than fully solving it. They increase compute costs linearly while the underlying degradation curve remains. At some point, you're paying for tokens the model can't effectively use.
How Recursive Language Models Work
Researchers at MIT — Alex L. Zhang, Tim Kraska, and Omar Khattab — propose a different architecture. Instead of feeding all context into one prompt, a Recursive Language Model (RLM) gives the model a Python programming environment where the full context is stored as a variable. The model — called the "root LM" — receives only the user's question. It then writes code to explore, filter, and process the context at its own pace. It can peek at specific portions, search for keywords, split data into chunks, and even spawn smaller AI models to handle individual pieces.
From the outside, it looks like a single API call. Under the hood, the model is orchestrating its own research process.
Agents are designed based on human intuition on how to break down a problem. RLMs are designed based on the principle that fundamentally, language models should decide how to break down a problem. — Alex Zhang
This is the key distinction. Current agent frameworks — ReAct, CodeAct, Claude Code — prescribe how a model should decompose tasks. RLMs defer that decision to the model itself. Whether that's genuinely better depends on the task.
The Performance Case Is Strong — With Caveats
The numbers reported by the authors in the research paper are worth attention, but they should be read as early research results, not as settled production benchmarks.
In the initial OOLONG experiments, an RLM using GPT-5-mini outperformed GPT-5 by more than 2x on a 132k-token split while staying in a similar cost range. The paper also reports strong behavior at 10M+ tokens, where the model interacts with context through a REPL rather than ingesting it directly.
Across four benchmarks, RLMs outperformed GPT-5 by a median of 26% against standard compaction methods and 13% against Claude Code. The first model explicitly trained for this pattern, RLM-Qwen3-8B, built on an open-weight 8-billion-parameter base, outperformed its underlying model by 28% and approached full GPT-5 quality on three tasks.
But these results come with honest limitations. On mathematical reasoning tasks, the RLM framework made models worse — the recursive scaffolding adds complexity that hurts when the base model can already solve the problem directly. A single query can take seconds to minutes instead of milliseconds. And today's models weren't trained for this pattern; they figure it out on the fly, which means they sometimes struggle with the scaffolding.
For teams evaluating AI infrastructure, the decision point is specific: if your primary use case involves long-context processing — document analysis, codebase-wide refactoring, multi-source research — RLMs deserve a pilot. If your tasks are short-context and the model already handles them well, the added latency buys you nothing.
The Industry Is Moving Fast
Framework adoption signals genuine interest, not just academic curiosity:
DSPy now includes an experimental RLM module for developers. A Google Developer Forums community article shows how the RLM pattern can be implemented with ADK. Prime Intellect has implemented RLMs in its training framework and describes reinforcement-learning-trained context management as a research direction. Daytona documents sandbox-based RLM implementations.
Teaching models to manage their own context end-to-end through reinforcement learning will be the next major breakthrough, enabling agents to solve long-horizon tasks spanning weeks to months. — Prime Intellect
What now?
RLMs solve a real, expensive problem, but the technology is in early stages. Here's what to do this week:
If you're running long-context AI agents in production, benchmark your current context management approach (compaction, summarization, retrieval) against the open-source RLM implementation on your actual workloads. A focused proof of concept should quickly show whether the pattern fits your workload.
If you're evaluating AI infrastructure for the next 12 months, include RLM-readiness as a criterion. Ask vendors whether their agent framework supports recursive context management and whether they plan to offer RLM-optimized models.
Start with the DSPy RLM module — it's the most accessible implementation today and works with any OpenAI-compatible API. Build a small proof-of-concept on one long-context task before investing further.
RLMs Are Early, but Promising for Long-Context Agents
RLMs are one of the most promising architectures for long-context agent work, but they are still early. Framework support is emerging; robust, broadly available RLM-optimized models are not yet the default. For production teams, the right move is not migration. It is a focused benchmark on one real long-context workload.
More articles
Introducing AKI.IO: The European AI API for Model Inference
A European AI API for teams that want EU-hosted inference with curated open-weight and open-source models such as Qwen, MiniMax, GPT-OSS, Llama, Apertus, Ministral, Flux.2, and more. Integrate through OpenAI- and Anthropic-compatible interfaces without self-hosting GPU infrastructure.
Agentic AI in Europe: What Teams Should Get Right Early
Agentic AI is moving beyond chat into systems that can read files, edit code, call tools, browse the web, run terminal commands, and complete work across multiple steps.
The AKI.IO Launch Manifesto
Let’s be honest: Europe did not win the race for general artificial intelligence. The United States and China are competing for dominance over frontier models — and with them, technological power.