Agentic Context Engineering

Paper (arXiv) | ICLR 2026 | Stanford, SambaNova, Samsung, Meta

The core problem: LLMs suffer from brevity bias (dropping domain insights to stay concise) and context collapse (iterative modifications eroding information quality over time). Dynamic Cheatsheet, the prior state of the art, demonstrated this collapse dramatically: context shrank from 18,282 tokens to 122 tokens, accuracy cratered.

ACE’s fix

Treat contexts as evolving playbooks, not static prompts. Three specialized agents cycle through the work:

Generator produces reasoning trajectories for new queries
Reflector extracts concrete insights from successes and failures
Curator integrates insights into structured context updates

ACE represents knowledge as structured, itemized bullets with metadata (identifiers, helpfulness counters). Instead of rewriting the full context each cycle, it applies incremental delta updates. Only relevant bullets change. Semantic embeddings handle deduplication.

Three properties fall out of this structure: localization (edit only what changed), fine-grained retrieval (generator pulls pertinent knowledge), and incremental adaptation (merge, prune, deduplicate without full rewrites).

Grow-and-refine mechanism

New bullets append. Existing bullets update in-place with incremented counters. The system runs deduplication either proactively (after each delta) or lazily (when the context window fills up). This balances accuracy against latency.

Offline vs. online

Offline: Multi-epoch adaptation on training splits. ACE revisits queries to progressively strengthen contexts. Evaluated with pass@1 accuracy on test splits.

Online: Sequential evaluation on test splits. For each sample, predict with current context, then update. Localized deltas merge in parallel, making batch adaptation efficient.

Results

Task	Improvement
AppWorld (agent benchmark)	+10.6%
Finance (FiNER)	+8.6% avg, +18.0% on numerical reasoning
Medical (DDXPlus)	+15.0%
SQL (BIRD-SQL)	+5.1%

On AppWorld, ACE matched production GPT-4.1 agent performance using a smaller DeepSeek model. It cut adaptation latency by 82-91% compared to baselines like GEPA and Dynamic Cheatsheet.

Why this matters for Context Engineering

Most context engineering today is manual. You write CLAUDE.md files, design skills, configure sub-agents. ACE points toward automated context evolution: systems that learn which context helps and restructure themselves accordingly.

A principle that echoes across Harness Engineering: LLMs perform better with long, detailed contexts than with compressed summaries. The model determines relevance on its own. You preserve the information. Let the model decide what to use.

ACE works best for tasks demanding detailed domain knowledge and complex tool use. Simpler tasks can get away with concise instructions. The gap between those two regimes is where most teams misjudge their context strategy.

Agentic Context Engineering

ACE’s fix

Grow-and-refine mechanism

Offline vs. online

Results

Why this matters for Context Engineering

Backlinks

Table of Contents

Graph View

Agentic Context Engineering

ACE’s fix

Grow-and-refine mechanism

Offline vs. online

Results

Why this matters for Context Engineering

Related reading

GPT-Rosalind

The Prompt Is Not the Product

Harness Engineering

Backlinks

Table of Contents

Graph View