Paper (arXiv) | ICLR 2026 | Stanford, SambaNova, Samsung, Meta
The core problem: LLMs suffer from brevity bias (dropping domain insights to stay concise) and context collapse (iterative modifications eroding information quality over time). Dynamic Cheatsheet, the prior state of the art, demonstrated this collapse dramatically: context shrank from 18,282 tokens to 122 tokens, accuracy cratered.
ACE’s fix
Treat contexts as evolving playbooks, not static prompts. Three specialized agents cycle through the work:
- Generator produces reasoning trajectories for new queries
- Reflector extracts concrete insights from successes and failures
- Curator integrates insights into structured context updates
ACE represents knowledge as structured, itemized bullets with metadata (identifiers, helpfulness counters). Instead of rewriting the full context each cycle, it applies incremental delta updates. Only relevant bullets change. Semantic embeddings handle deduplication.
Three properties fall out of this structure: localization (edit only what changed), fine-grained retrieval (generator pulls pertinent knowledge), and incremental adaptation (merge, prune, deduplicate without full rewrites).
Grow-and-refine mechanism
New bullets append. Existing bullets update in-place with incremented counters. The system runs deduplication either proactively (after each delta) or lazily (when the context window fills up). This balances accuracy against latency.
Offline vs. online
Offline: Multi-epoch adaptation on training splits. ACE revisits queries to progressively strengthen contexts. Evaluated with pass@1 accuracy on test splits.
Online: Sequential evaluation on test splits. For each sample, predict with current context, then update. Localized deltas merge in parallel, making batch adaptation efficient.
Results
| Task | Improvement |
|---|---|
| AppWorld (agent benchmark) | +10.6% |
| Finance (FiNER) | +8.6% avg, +18.0% on numerical reasoning |
| Medical (DDXPlus) | +15.0% |
| SQL (BIRD-SQL) | +5.1% |
On AppWorld, ACE matched production GPT-4.1 agent performance using a smaller DeepSeek model. It cut adaptation latency by 82-91% compared to baselines like GEPA and Dynamic Cheatsheet.
Why this matters for Context Engineering
Most context engineering today is manual. You write CLAUDE.md files, design skills, configure sub-agents. ACE points toward automated context evolution: systems that learn which context helps and restructure themselves accordingly.
A principle that echoes across Harness Engineering: LLMs perform better with long, detailed contexts than with compressed summaries. The model determines relevance on its own. You preserve the information. Let the model decide what to use.
ACE works best for tasks demanding detailed domain knowledge and complex tool use. Simpler tasks can get away with concise instructions. The gap between those two regimes is where most teams misjudge their context strategy.