[GH-ISSUE #14872] 🚀 Architecture Proposal: Swarm Memory (Zero-Copy KV-Cache Sharing Across Heterogeneous Models) #56102

Closed
opened 2026-04-29 10:16:17 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @allornothingai on GitHub (Mar 16, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14872

🚀 Architecture Proposal: Swarm Memory (Zero-Copy KV-Cache Sharing Across Heterogeneous Models)

Hi @jmorganca and the Ollama team. Ollama has won the local inference war, but as we move from single-chat interfaces to autonomous multi-agent swarms (running Qwen for code, LLaVA for vision, and Mistral for routing concurrently), we are hitting a hard VRAM wall.

The core bottleneck isn't the model weights; it's the redundant KV-Cache.

I propose a fundamental architectural shift for the Ollama engine: Swarm Memory.

The Problem: Redundant Context Processing

Right now, if I have three agents analyzing the same 20,000-token codebase, I have to pass that 20k token prompt to three separate models. Ollama computes the KV-cache three separate times, occupying 3x the VRAM and burning 3x the compute, even though the prefix context is mathematically identical.

The Vision: A Unified, Zero-Copy Memory Bus

Ollama should not just be a model runner; it should act as a central Memory Server.

When a user defines a system prompt or a large context window, Ollama computes the KV-cache once and allocates it to a shared memory region in Metal/CUDA. When subsequent models are called with the same prefix context, Ollama detects the exact hash match of the tokens.

Instead of recomputing the cache, the inference engine simply passes a zero-copy pointer to the shared memory region.

// Proposed Architecture Flow:
// 1. User loads large codebase into context.
// 2. Ollama routes to SwarmMemoryManager.
// 3. Cache generated -> Shared VRAM allocation (Address: 0xFA88...)
// 4. Agent A (Qwen) starts generation -> Reads prefix from 0xFA88...
// 5. Agent B (Llama3) starts generation -> Reads prefix from 0xFA88...

Why this changes everything:

If implemented, a 64GB Mac Studio could run 10-15 autonomous agents concurrently operating on the exact same massive context window without ever crashing, because the context is only stored in VRAM once.

It fundamentally shifts Ollama from a "local ChatGPT" to an "Enterprise Swarm OS."

Next Steps

If the core team is interested in tackling this memory optimization, my engineering team (allornothingai) has extensive experience in C++/Metal shared memory architectures. We are prepared to write the memory routing layer and submit a PR.

Let's build the ultimate swarm architecture.

Originally created by @allornothingai on GitHub (Mar 16, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14872 # 🚀 Architecture Proposal: Swarm Memory (Zero-Copy KV-Cache Sharing Across Heterogeneous Models) Hi @jmorganca and the Ollama team. Ollama has won the local inference war, but as we move from single-chat interfaces to autonomous multi-agent swarms (running Qwen for code, LLaVA for vision, and Mistral for routing concurrently), we are hitting a hard VRAM wall. The core bottleneck isn't the model weights; it's the **redundant KV-Cache**. I propose a fundamental architectural shift for the Ollama engine: **Swarm Memory**. ### The Problem: Redundant Context Processing Right now, if I have three agents analyzing the same 20,000-token codebase, I have to pass that 20k token prompt to three separate models. Ollama computes the KV-cache three separate times, occupying 3x the VRAM and burning 3x the compute, even though the prefix context is mathematically identical. ### The Vision: A Unified, Zero-Copy Memory Bus Ollama should not just be a model runner; it should act as a central **Memory Server**. When a user defines a system prompt or a large context window, Ollama computes the KV-cache once and allocates it to a shared memory region in Metal/CUDA. When subsequent models are called with the same prefix context, Ollama detects the exact hash match of the tokens. Instead of recomputing the cache, the inference engine simply passes a **zero-copy pointer** to the shared memory region. ```go // Proposed Architecture Flow: // 1. User loads large codebase into context. // 2. Ollama routes to SwarmMemoryManager. // 3. Cache generated -> Shared VRAM allocation (Address: 0xFA88...) // 4. Agent A (Qwen) starts generation -> Reads prefix from 0xFA88... // 5. Agent B (Llama3) starts generation -> Reads prefix from 0xFA88... ``` ### Why this changes everything: If implemented, a 64GB Mac Studio could run 10-15 autonomous agents concurrently operating on the *exact same* massive context window without ever crashing, because the context is only stored in VRAM once. It fundamentally shifts Ollama from a "local ChatGPT" to an "Enterprise Swarm OS." ### Next Steps If the core team is interested in tackling this memory optimization, my engineering team (`allornothingai`) has extensive experience in C++/Metal shared memory architectures. We are prepared to write the memory routing layer and submit a PR. Let's build the ultimate swarm architecture.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56102