Prompt engineering asks “how do I phrase this question?” Context engineering asks “what does the model need to know before I ask?” The second question determines whether production systems work.

Andrej Karpathy defined it as “the delicate art and science of filling the context window with just the right information for the next step.” Tobi Lutke of Shopify pushed the term into mainstream use: “the art of providing all the context for the task to be plausibly solvable by the LLM.” Both identified the same gap. People associate “prompts” with short instructions typed into a chat box. Production AI systems require structured information architecture.

Why the shift happened

Over 70% of errors in production LLM applications trace back to incomplete, irrelevant, or poorly structured context. The models got good enough. The bottleneck moved to the context side.

LLMs have a finite attention budget. Every token in the context window competes for that attention. As context grows, precision drops, reasoning weakens, and the model starts missing things. Researchers call this “context rot”: performance degrades unpredictably as input expands. Context engineers maximize signal density within that budget.

The components

Karpathy breaks context engineering into concrete pieces:

  • Task descriptions and explanations: What you want done and why
  • Few-shot examples: Show, don’t tell
  • RAG (retrieval): Pull in relevant documents, code, data on demand
  • Multimodal data: Images, structured data, API responses
  • Tools: Capabilities the model can invoke (each tool description eats tokens)
  • State and history: Conversation memory, prior actions, intermediate results
  • Compacting: Compression and summarization to fit the budget

You choose what goes in. Everything else stays out.

Five operations

Selection, compression, ordering, isolation, and format optimization form the practical toolkit:

Selection: Pick what belongs. RAG retrieves documents. Tool selection retrieves only the 5-8 relevant tools instead of loading all 50. LangChain reports 3x improvement in tool selection accuracy from dynamic loading over static.

Compression: Squeeze more signal into fewer tokens. Summarize history. Chunk documents at semantic boundaries with 10-15% overlap. Strip metadata that adds noise.

Ordering: Position matters. Models attend differently to the beginning and end of context versus the middle (“lost in the middle” problem). Put critical instructions and recent state where attention is strongest.

Isolation: Delegate subtasks to separate agents with clean context windows. A single agent handling code exploration, document parsing, and summary writing has polluted context by the time it reaches the summary. Sub-agents solve this with context firewalls.

Format optimization: Structure data so the model can parse it efficiently. JSON for structured data. Markdown headers for documents. Consistent schemas across tool outputs.

Memory tiers

Production agents run on layered memory:

  • Short-term: Current conversation, recent tool outputs
  • Working memory: Temporary state needed for a multi-step task (destination, dates, budget while booking a trip)
  • Long-term: Persistent facts, user preferences, learned patterns

A Cognitive Workspace study found 58.6% memory reuse for agents using structured state-based memory versus 0% for classical RAG. Agents that remember structured facts beat agents that re-retrieve everything from scratch.

Relationship to harness engineering

Context engineering is one layer of Harness Engineering. The harness includes tools, permissions, hooks, feedback loops, and orchestration. Context engineering focuses specifically on what goes into the context window and how it gets there.

Martin Fowler’s context engineering for coding agents covers the coding-specific angle: CLAUDE.md files, progressive skill disclosure, sub-agent isolation. His harness engineering framework wraps context engineering inside a broader system of guides (feedforward) and sensors (feedback).

Where RAG fits now

RAG started as “retrieve documents, stuff them in the prompt.” In 2026 it’s evolving into a context engine. Agents evaluate which data, tools, and memories matter for each reasoning step before retrieving anything.

Agentic RAG frameworks like Elysia use decision-tree architectures where a decision agent evaluates the environment, available tools, past actions, and future options before choosing what to retrieve. This moves retrieval from a static pipeline to an active reasoning step.

Karpathy’s Software 3.0 connection

Karpathy frames three eras of software: telling the machine how (1.0, traditional code), showing it what through examples (2.0, ML), telling it what you want (3.0, LLMs). Context engineering sits at the center of Software 3.0. The “program” is the context you construct. The LLM is the runtime.

The Ralph Wiggum Loop philosophy lands here too. You design the information environment where the agent writes code. You sit on the loop, watch it run, and tune the context when it drifts.

Automated context evolution

Most context engineering today is manual: writing CLAUDE.md files, designing skills, choosing what to retrieve. The ACE framework (ICLR 2026) points toward automated approaches. ACE treats contexts as structured, itemized playbooks that evolve through generate-reflect-curate cycles. Instead of rewriting the full context each iteration, it applies incremental delta updates: localized edits, deduplication via semantic embeddings, helpfulness counters on individual bullets. This cut adaptation latency by 82-91% over baselines and matched GPT-4.1 agent performance using smaller open-source models.

ACE validated a counterintuitive finding: LLMs perform better with long, detailed contexts than compressed summaries. The model filters for relevance on its own. You preserve information. Let the model decide what matters.

Practical patterns for coding agents

  • CLAUDE.md files: Human-written, under 60 lines. ETH Zurich data shows LLM-generated agent files hurt performance while burning 20%+ more tokens. Treat them as a table of contents.
  • Skills (progressive disclosure): Agent sees names and descriptions at startup. Full content loads on demand. Keeps baseline context lean.
  • Dynamic tool loading: Expose tool descriptions only when relevant. HumanLayer replaced a bloated MCP server with a 6-command CLI wrapper, saving thousands of tokens.
  • Context resets: Anthropic’s three-agent pattern (planner/generator/evaluator) resets context between agents. A progress file plus git history lets the next agent reconstruct state fresh.
  • Back-pressure: Test output, build logs, and type checker results validate agent work. But 4,000 lines of passing tests can cause hallucination. Compress validation output.

What goes wrong

  • Stuffing the context window with “just in case” information (more tokens, worse performance)
  • LLM-generated context files that optimize for token count over information density
  • Loading all available tools when the agent needs three
  • Treating RAG as a pipeline instead of an active reasoning step
  • Ignoring the ordering problem (critical info buried in the middle)

Sources