Claude Code Series · Part 2/6 15 min read

Context Compression: 4 Levels to Never Lose Anything

Your conversation with Claude Code can last hours, but the context window has a physical limit. Here's how a 4-tier compression system, a 9-section summary, and a "sacred section" guarantee that none of your instructions are ever lost.

Gonzalo Monzón

April 2, 2026

Abstract representation of data compression and context management

In the previous article we dismantled the 18 sections of Claude Code's system prompt. But there's a problem the system prompt doesn't solve: what happens when your conversation grows so large it no longer fits in the context window?

The default window is 200K tokens. With the context-1m beta flag, it can reach one million. But even one million tokens has a limit. And in intense coding sessions — chained error debugging, refactoring large files, documentation queries — it's normal to burn through context fast.

Claude Code's solution isn't a single mechanism. It's a 4-tier progressive compression system — each level more aggressive than the last, activating only when the previous one isn't enough. A defense-in-depth approach where the goal is clear: never lose a user instruction.

⚡ Thesis of this article

Claude Code uses a 4-tier progressive context compression system: from clearing old tool results to summarizing the entire conversation into 9 XML sections. Tier 3 contains a "sacred section" — user instructions — that are never omitted under any circumstance.

The Problem

Your conversation never ends, but the window does

Think of it like a computer's RAM. You have a finite amount, and the more programs you open, the fuller it gets. The operating system doesn't close your programs — it compresses data, moves things to disk, and frees up space while maintaining the illusion of infinite memory.

Claude Code does exactly the same with your conversation. The analogy is precise:

🖥️ Operating System

• CPU cache → hot data, immediate access
• L1/L2 cache → frequent data, fast access
• RAM → active data, normal access
• Disk → cold data, slow access

🤖 Claude Code

• Tier 0 → everything in context, no compression
• Tier 1 → clears old tool results
• Tier 2 → server clears entire blocks
• Tier 3 → full 9-section summary

The constant that governs everything is simple: Claude Code measures conversation size in tokens using a fast heuristic — BYTES_PER_TOKEN = 4. It doesn't count real tokens (that would require a tokenizer), but divides bytes by 4. It's imprecise but predictable and cheap.

// Core constants of the compression system
const BYTES_PER_TOKEN = 4;
const MAX_INPUT_TOKENS = 180_000;          // Threshold to activate compression
const TARGET_COMPACT_TOKENS = 40_000;      // Post-compression target
const AUTOCOMPACT_BUFFER_TOKENS = 10_000;  // Safety buffer
const POST_COMPACT_TOKEN_BUDGET = 50_000;  // Re-injection budget

Overview

The 4 Tiers at a Glance

Tier	Name	Strategy	Aggressiveness
0	Normal	No compression. All tool results kept intact.	None
1	Microcompact	Surgical inline clearing of old tool results.	Low
2	API-Native	Server removes entire thinking and tool_use blocks.	Medium
3	Full Compaction	9-section summary. Complete conversation rewrite.	Maximum

Tier 0

Normal Operation — No Compression

In normal operation, Claude Code doesn't touch the context. Every tool result, every thinking block, every user and assistant message — everything stays in the context window as-is.

The default window is 200,000 tokens. With the context-1m beta header, it expands to 1 million. At 4 bytes per token, 200K tokens equals ~800KB of text — roughly 130,000 words. For short to medium sessions, this is more than enough.

The system monitors context usage every turn. When it approaches MAX_INPUT_TOKENS (180K), the system begins escalating toward Tier 1.

Tier 1

Microcompact — Surgical Cleanup

The first compression level is the gentlest way to free up space. It works by replacing old tool results in-place — the original message stays in the conversation, but its content is substituted with a marker.

// The replacement text is always the same:
"[Old tool result content cleared]"

// Compactable tools (have reproducible results):
const COMPACTABLE_TOOLS = [
  'FileRead',      // Can re-read the file
  'Bash',          // Can re-run the command
  'Grep',          // Can re-search
  'Glob',          // Can re-list
  'WebSearch',     // Can re-search the web
  'WebFetch',      // Can re-download
  'FileEdit',      // The diff was already applied
  'FileWrite',     // The file was already written
];

The selection logic is time-based: oldest results are cleared first, preserving the N most recent ones. This is based on a reasonable premise: if you read a file 50 turns ago, you probably don't need to see its literal content anymore. If you do, Claude can read it again.

💡 Why not clear everything? Only tools with reproducible results are cleared. If you can re-read a file or re-run a grep, losing the old result has minimal cost. Results from tools with side effects (like a git commit) are never cleared.

Microcompact is elegant because it doesn't alter the conversation structure. The model still sees that it used a tool and in what order — it just loses the result details. It's like clearing the output of a command in your terminal but keeping the history of which commands you ran.

Tier 2

API-Native — The Server Decides

When Tier 1 doesn't free up enough space, Claude Code activates server-side context management. This uses Anthropic's API beta header context-management-2025-06-27.

Unlike Tier 1, which operates client-side by manipulating messages, Tier 2 tells Anthropic's server: "remove these entire blocks from the conversation". There are two available strategies:

🧹

`clear_tool_uses`

Completely removes old tool_use and tool_result blocks from the conversation. It doesn't mark them as cleared — it deletes them. The model no longer knows it used those tools.

🧠

`clear_thinking`

Removes thinking blocks (extended reasoning) from older turns. The model's thinking tends to be verbose — clearing it can free thousands of tokens without losing the actual results.

// Tier 2 thresholds
if (inputTokens > MAX_INPUT_TOKENS) {        // 180,000
  // Activate server-side context management
  headers['anthropic-beta'] = 'context-management-2025-06-27';
  
  // Target: reduce to TARGET_COMPACT_TOKENS
  targetTokens = TARGET_COMPACT_TOKENS;      // 40,000
}

The key difference from Tier 1 is that Tier 2 deletes entire blocks, not just contents. It's more aggressive, but still deterministic — there's no interpretation or summarization, just selective removal of sections.

Tier 3

Full Compaction — The 9-Section Summary

When neither microcompact nor API-native management are sufficient, Claude Code deploys its ultimate weapon: a dedicated agent that summarizes the entire conversation into a structured XML format with 9 sections.

This isn't a free-form summary. It's a strict template that the compaction agent must follow. The output has two parts: an <analysis> block (internal reasoning) and a <summary> block (the final summary). Only the <summary> block is injected as the new conversation.

The 9 mandatory sections are:

Primary Request & Goal

The user's primary objective. What they're trying to accomplish in the conversation.

Key Technical Concepts

Technologies, frameworks, APIs, and technical concepts discussed.

Files & Code Modified

List of files touched, what changes were made, current state.

Errors & Fixes

Errors encountered and how they were resolved. Critical to avoid repeating failed attempts.

Problem Solving Approach

Strategies attempted, design reasoning, technical decisions made.

All User Messages SACRED

MUST NOT OMIT ANY user messages. All user instructions, questions, and clarifications are preserved in full. This is the section that guarantees no user instruction is ever lost, regardless of how much the conversation is compressed.

Pending Tasks

Tasks left to do. Prevents compaction from causing amnesia about work-in-progress.

Current Work

What exactly was being worked on when compaction was triggered.

Optional Next Step

Suggestion for the next logical step to continue the work.

🛡️ Section 6 is "sacred." The compaction agent's prompt explicitly states: "MUST NOT OMIT ANY user messages". Everything is formatted as verbatim quotes. It's the system's fundamental guarantee: no matter how much the conversation is compressed, your instructions are preserved in full.

// XML format of the compaction summary
<analysis>
  ... internal reasoning of the compaction agent ...
</analysis>

<summary>
  <section title="Primary Request & Goal">...</section>
  <section title="Key Technical Concepts">...</section>
  <section title="Files & Code Modified">...</section>
  <section title="Errors & Fixes">...</section>
  <section title="Problem Solving Approach">...</section>
  <section title="All User Messages">
    // ALL user messages go here, none omitted
    User: "Refactor the auth module to use JWT"
    User: "Oh, and make sure the refresh token expires in 7 days"
    User: "Rename the field from 'token' to 'accessToken'"
    ...
  </section>
  <section title="Pending Tasks">...</section>
  <section title="Current Work">...</section>
  <section title="Optional Next Step">...</section>
</summary>

Recovery

Post-Compaction Recovery

After running Tier 3, the conversation has been dramatically reduced — from potentially 200K+ tokens to a structured summary. But a summary alone isn't enough for the model to effectively resume work. Claude Code runs a recovery process that re-injects critical context.

// Post-compaction recovery budget
const POST_COMPACT_TOKEN_BUDGET = 50_000;

// 1. Restore the 5 most recently used files
const RECENT_FILES_COUNT = 5;
const MAX_TOKENS_PER_FILE = 5_000;
// Total: up to 25,000 tokens in files

// 2. Restore active skills
const SKILL_RESTORE_BUDGET = 25_000;
// Skills that were in use when compaction occurred

The process follows two steps:

1. Recent file restoration

The 5 files the model was most recently reading or editing are re-injected into context, each with a maximum of 5,000 tokens. This gives the model an immediate "glimpse" at the files it was working on, without needing to re-read them from scratch.

2. Active skills restoration

If the conversation had loaded skills (documentation, special instructions, MCP tool context), they're re-injected up to a budget of 25,000 tokens. This ensures the model doesn't lose specialized knowledge acquired during the session.

Post-recovery is what differentiates Claude Code's compaction from simply "truncating the conversation." You don't lose context — you compress it and then restore the most important parts. It's like waking up from a nap with a sticky note of the 5 most important things on your nightstand.

Safety

Circuit Breaker — When Everything Fails

What happens if compaction fails? If the summary agent generates a summary that's still too large? If there's a loop where it compacts, grows, compacts, grows?

Claude Code has a circuit breaker for exactly this scenario:

// Maximum consecutive compactions without progress
const MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3;

// If 3 consecutive failures are reached:
// → Auto-compaction is stopped
// → The user is notified
// → /clear is suggested for manual reset

// Override for special cases:
env.CLAUDE_AUTOCOMPACT_PCT_OVERRIDE = "80";
// Forces compaction when context reaches 80%

The circuit breaker prevents infinite compaction loops. If the system tries to compact 3 times in a row and still can't reduce context enough, it stops trying. This can happen in conversations with extremely dense content where even the summary is large.

For special situations, the environment variable CLAUDE_AUTOCOMPACT_PCT_OVERRIDE exists. Setting a value like "80" forces compaction to activate when context reaches 80% of the maximum, instead of waiting for the default threshold.

Flow

Decision Tree — When Each Tier Activates

Each tier's activation isn't random — it's a deterministic decision tree based on token count:

// Pseudocode of the decision tree
function decideCompressionTier(inputTokens, maxTokens) {
  const utilizationPct = inputTokens / maxTokens;

  // Are we within budget?
  if (inputTokens < MAX_INPUT_TOKENS - AUTOCOMPACT_BUFFER_TOKENS) {
    return TIER_0; // No compression
  }

  // Are there tool results to clear?
  if (hasCompactableToolResults(conversation)) {
    clearOldToolResults();  // Tier 1
    if (estimateTokens() < MAX_INPUT_TOKENS) {
      return TIER_1; // Microcompact was enough
    }
  }

  // Can the server clear blocks?
  if (betaHeaderSupported('context-management')) {
    activateServerSideClearing(); // Tier 2
    if (estimateTokens() < MAX_INPUT_TOKENS) {
      return TIER_2; // API-native was enough
    }
  }

  // Last resort: full compaction
  if (consecutiveFailures < MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES) {
    return TIER_3; // 9-section summary
  }

  // Circuit breaker activated
  notifyUser("Context too dense. Use /clear to reset.");
  return CIRCUIT_BREAKER;
}

Application

Patterns You Can Apply Today

Claude Code's compression system isn't just an internal engineering exercise. There are concrete patterns you can apply to your own agents or LLM workflows:

1. Multi-layer defense (not a single strategy)

Don't build a single compression mechanism. Use progressive levels — first the gentlest (clear reproducible results), then medium (delete entire blocks), and only at the end the most aggressive (summarize everything). Each level has a different cost and a different information loss.

2. Never lose user instructions

If you summarize a conversation to save context, treat user messages as sacred data. Technical context can be reconstructed (re-read files, re-run commands), but user instructions are unique and irrecoverable. Section 6 of Claude Code's summary exists for this reason.

3. Active post-recovery

After compressing, don't wait for the model to ask for what it needs. Proactively re-inject the most recent files and skill context. It's the difference between waking up with no memory and waking up with a summary and your 5 most important files already open.

4. Circuit breakers in every auto-compression system

If your agent auto-compresses, set a maximum for consecutive attempts. Without a circuit breaker, you can end up in a loop where the agent spends tokens compressing, the summary is nearly as large, and it compresses again. After 3 consecutive failures, stop and ask for human intervention.

Takeaway

The Illusion of Infinite Memory

Claude Code's context compression system is a virtual memory system for LLMs. Just as your operating system gives you the illusion of infinite RAM by combining cache, RAM, and swap — Claude Code gives you the illusion of an infinite conversation by combining 4 levels of progressive compression.

The key to the design is progressiveness: each level is more aggressive but only activates when the previous one fails. And the invariant that runs through the entire system is that user instructions are never lost — section 6 of the summary is sacred.

In the next article in this series, we'll explore persistent memory systems — how Claude Code remembers things between sessions. Auto-extract, session memory, magic docs, and auto-dream: four mechanisms that work while you sleep so your next session picks up where the last one left off.

📚 Claude Code Anatomy Series: This is article 2 of 6. The previous one covers the 18-layer system prompt. The following articles cover memory systems, hidden features, the tool & permission system, and the dual build (internal vs public).

Building agents with LLMs?

At Cadences we design agents using the same industrial patterns we discovered in Claude Code. If you want to apply these principles to your business, let's talk.

Discover Cadences More articles →

Gonzalo Monzón

Founder of CadencesLab. Software engineer, multi-agent systems architect, and perpetual student of how machines think. This series is born from months of reverse engineering the tools we use daily.

Context Compression: 4 Levels to Never Lose Anything

⚡ Thesis of this article

Your conversation never ends, but the window does

🖥️ Operating System

🤖 Claude Code

The 4 Tiers at a Glance

Normal Operation — No Compression

Microcompact — Surgical Cleanup

API-Native — The Server Decides

`clear_tool_uses`

`clear_thinking`

Full Compaction — The 9-Section Summary

Post-Compaction Recovery

1. Recent file restoration

2. Active skills restoration

Circuit Breaker — When Everything Fails

Decision Tree — When Each Tier Activates

Patterns You Can Apply Today

1. Multi-layer defense (not a single strategy)

2. Never lose user instructions

3. Active post-recovery

4. Circuit breakers in every auto-compression system

The Illusion of Infinite Memory

Building agents with LLMs?

Gonzalo Monzón

More in the series

How Claude Code Builds Its System Prompt: 18 Layers You Never See

4 Memory Systems Working While You Sleep

Don't miss any story