Files
awesome-llm-apps/advanced_llm_apps/llm_optimization_tools/headroom_context_optimization
tchopra 78095b8213 Add Headroom Context Optimization to LLM Optimization Tools
Headroom reduces LLM API costs by 50-90% through intelligent context
compression for AI agents. Tool outputs are 70-95% redundant boilerplate
that LLMs pay for but don't need - Headroom compresses that away while
preserving accuracy.

Features:
- SmartCrusher: Statistical compression of JSON tool outputs
- CacheAligner: Provider-side cache optimization
- Memory: Persistent memory across conversations
- MCP support: Tool output compression for Claude
- Framework integrations: LangChain, Agno, any OpenAI client
2026-01-20 00:13:53 -08:00
..

🧠 Headroom - Context Optimization Layer

Reduce LLM API costs by 50-90% through intelligent context compression. Tool outputs are 70-95% redundant boilerplate—Headroom compresses that away while preserving accuracy.

📋 Overview

This app demonstrates how to use Headroom to dramatically reduce token usage when working with AI agents and tool-heavy LLM applications. Unlike simple truncation, Headroom uses statistical analysis to keep what matters and compress what doesn't.

Key Benefits

  • 💰 47-92% token reduction verified across real workloads
  • 🎯 Zero code changes - works as a transparent proxy
  • 🔄 Reversible compression - LLM can retrieve original data via CCR
  • 🧠 Content-aware - handles code, logs, JSON optimally
  • Provider caching - automatic prefix optimization for cache hits
  • 🔌 Framework native - LangChain, Agno, MCP, any OpenAI client

🚀 Features

  • SmartCrusher: Statistical compression of JSON tool outputs—keeps first items, last items, anomalies, and query-relevant matches
  • CacheAligner: Stabilizes prefixes for better provider-side caching (OpenAI, Anthropic, Google)
  • RollingWindow: Manages context limits without breaking tool call/response pairing
  • Code-Aware Compression: AST-based compression using tree-sitter
  • LLMLingua-2 Integration: Optional ML-based 20x compression
  • Memory System: Persistent, LLM-driven memory with zero-latency inline extraction
  • CCR (Compress-Cache-Retrieve): Reversible compression—LLM requests original data when needed

📦 Installation

Basic Installation

pip install headroom-ai

With Framework Integrations

pip install "headroom-ai[proxy]"      # Proxy server (zero code changes)
pip install "headroom-ai[langchain]"  # LangChain integration
pip install "headroom-ai[agno]"       # Agno agent framework
pip install "headroom-ai[code]"       # AST-based code compression
pip install "headroom-ai[llmlingua]"  # ML-based compression
pip install "headroom-ai[all]"        # Everything

💻 Usage

Option 1: Proxy (Zero Code Changes)

headroom proxy --port 8787

Point your existing tools at the proxy:

# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude

# Cursor or any OpenAI-compatible client
OPENAI_BASE_URL=http://localhost:8787/v1 cursor

Option 2: LangChain Integration

from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel

# Wrap your model - that's it!
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
response = llm.invoke("Analyze these logs and find the error")

Option 3: Agno Agent Framework

from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

# Wrap your model
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
agent = Agent(model=model, tools=[search_github, search_code, query_db])

response = agent.run("Investigate the memory leak")
print(f"Tokens saved: {model.total_tokens_saved}")

📊 Real-World Performance

These numbers are from actual API calls, not estimates:

Scenario Before After Savings
Code search (100 results) 17,765 tokens 1,408 tokens 92%
SRE incident debugging 65,694 tokens 5,118 tokens 92%
Codebase exploration 78,502 tokens 41,254 tokens 47%
GitHub issue triage 54,174 tokens 14,761 tokens 73%
Multi-tool agent 15,662 tokens 6,100 tokens 76%

🔬 Proof: Needle in Haystack Test

Setup: 100 production log entries. One critical FATAL error buried at position 67.

Before Headroom: 10,144 tokens After Headroom: 1,260 tokens (87.6% reduction)

The question: "What caused the outage? What's the error code? What's the fix?"

Both responses (baseline and Headroom): "payment-gateway service, error PG-5523, fix: Increase max_connections to 500, 1,847 transactions affected"

Same answer. 87.6% fewer tokens.

# Run it yourself
python headroom_demo.py

🎯 Best Use Cases

Use Headroom when:

  • Building AI agents with multiple tools (search, database, APIs)
  • Processing large tool outputs (logs, code search results, API responses)
  • Context window is filling up with redundant data
  • Reducing LLM API costs at scale

Headroom shines with:

  • 🔍 Code search results
  • 📋 Log analysis
  • 🗄️ Database query results
  • 🔗 API response processing
  • 🤖 Multi-tool agent workflows

🛡️ Safety Guarantees

  • Never removes human content - user/assistant messages preserved
  • Never breaks tool ordering - tool calls and responses stay paired
  • Parse failures are no-ops - malformed content passes through unchanged
  • Compression is reversible - LLM retrieves original via CCR

🔗 Resources

🔌 Provider Support

Provider Token Counting Cache Optimization
OpenAI tiktoken (exact) Automatic prefix caching
Anthropic Official API cache_control blocks
Google Official API Context caching
Cohere Official API -
Mistral Official tokenizer -

🤝 Contributing

Contributions are welcome! Feel free to:

  • Report bugs
  • Suggest new compression strategies
  • Add benchmarks
  • Improve documentation

📄 License

Apache License 2.0 - see LICENSE.

🙏 Credits

Built by Tejas Chopra for the AI developer community.