[GH-ISSUE #9890] Large context size completely breaks the usability of the model #32236

Closed
opened 2026-04-22 13:18:39 -05:00 by GiteaMirror · 27 comments
Owner

Originally created by @AlbertoSinigaglia on GitHub (Mar 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9890

What is the issue?

Just installed Gemma3, with context length 131072, and just learned that it means nothing to Ollama, since it still works with 2048 context size if not specified otherwise.

So, if i run it with the default context, it runs smoothly and loads the model correctly on a single GPU, and outputs what's is supposed to in a matter of seconds.

Image

and has no problem at all at answering.

Image

However, as soon as I run /set parameter num_ctx 128000, it shards the model across GPUS, and never answers ever again.

Image Image

Context:
I'm running it on a server with 3x A6000, using the following config in systemctl edit ollama

Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_NUM_PARALLEL=3"

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.6.0

Originally created by @AlbertoSinigaglia on GitHub (Mar 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9890 ### What is the issue? Just installed [Gemma3](hf.co/unsloth/gemma-3-27b-it-GGUF:Q8_0), with `context length 131072`, and just learned that [it means nothing to Ollama](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size), since it still works with 2048 context size if not specified otherwise. So, if i run it with the default context, it runs smoothly and loads the model correctly on a single GPU, and outputs what's is supposed to in a matter of seconds. <img width="344" alt="Image" src="https://github.com/user-attachments/assets/bac07c99-433b-4ab6-a66d-33d86136fcb2" /> and has no problem at all at answering. <img width="205" alt="Image" src="https://github.com/user-attachments/assets/c7a8c553-dbed-4092-8a9c-cffb6a7f228e" /> However, as soon as I run `/set parameter num_ctx 128000`, it shards the model across GPUS, and never answers ever again. <img width="422" alt="Image" src="https://github.com/user-attachments/assets/bb1ac034-fc9d-49c3-a797-9b009c6bdaa3" /> <img width="269" alt="Image" src="https://github.com/user-attachments/assets/20c2a076-22c1-4b01-b17b-a0977142b2c3" /> ---- Context: I'm running it on a server with 3x A6000, using the following config in `systemctl edit ollama` ``` Environment="OLLAMA_DEBUG=1" Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_MAX_LOADED_MODELS=3" Environment="OLLAMA_NUM_PARALLEL=3" ``` ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.0
GiteaMirror added the bug label 2026-04-22 13:18:39 -05:00
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):

To be fair, the model after like 5 minutes answers, yet it works at 1t/s, where initially is around 10t/s... I would understand it if the context was actually 128K, but since both the versions work with almost no context in input, feels weird that with a longer context it has such a slowdown

<!-- gh-comment-id:2736433096 --> @AlbertoSinigaglia commented on GitHub (Mar 19, 2025): To be fair, the model after like 5 minutes answers, yet it works at 1t/s, where initially is around 10t/s... I would understand it if the context was actually 128K, but since both the versions work with almost no context in input, feels weird that with a longer context it has such a slowdown
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

Environment="OLLAMA_NUM_PARALLEL=3"

This is tripling the size of the context that ollama allocates on the GPU. This is causing a bunch of layers to be loaded in to system RAM, and inference is much slower. Server logs will show how much VRAM the increased context size is consuming.

<!-- gh-comment-id:2736452104 --> @rick-github commented on GitHub (Mar 19, 2025): ``` Environment="OLLAMA_NUM_PARALLEL=3" ``` This is tripling the size of the context that ollama allocates on the GPU. This is causing a bunch of layers to be loaded in to system RAM, and inference is much slower. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will show how much VRAM the increased context size is consuming.
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

https://github.com/ollama/ollama/issues/9791#issuecomment-2727576383

<!-- gh-comment-id:2736464623 --> @rick-github commented on GitHub (Mar 19, 2025): https://github.com/ollama/ollama/issues/9791#issuecomment-2727576383
Author
Owner

@ALLMI78 commented on GitHub (Mar 19, 2025):

"the model after like 5 minutes answers, yet it works at 1t/s"

Thats exactly what i was seeing in my setup with a 32k context... I feel like there are still issues with larger context sizes, but since I can't break down my entire workflow into a reproducible example, I’m having trouble presenting the problem. Maybe you can... I also wasn’t sure if my GPU (4060 Ti 16GB) is simply too weak for Gemma 3.

<!-- gh-comment-id:2736472517 --> @ALLMI78 commented on GitHub (Mar 19, 2025): "the model after like 5 minutes answers, yet it works at 1t/s" Thats exactly what i was seeing in my setup with a 32k context... I feel like there are still issues with larger context sizes, but since I can't break down my entire workflow into a reproducible example, I’m having trouble presenting the problem. Maybe you can... I also wasn’t sure if my GPU (4060 Ti 16GB) is simply too weak for Gemma 3.
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

More context == less room for model weights in VRAM. Less room for model weights in VRAM == model weights in system RAM. Model weights in system RAM == slower inference.

<!-- gh-comment-id:2736482022 --> @rick-github commented on GitHub (Mar 19, 2025): More context == less room for model weights in VRAM. Less room for model weights in VRAM == model weights in system RAM. Model weights in system RAM == slower inference.
Author
Owner

@ALLMI78 commented on GitHub (Mar 19, 2025):

Dear Rick,

You're clearly the pro here, and I don’t mean to challenge you, but if someone with 3x A6000 GPUs can’t get Gemma 3 running properly, is that really expected behavior?

Your explanation makes sense and is technically correct, but again—I can run 14B models on my GPU with a 32k context without any issues, yet I can't get Gemma 3 (12B) to work, experiencing exactly the symptoms described here.

I remember reading somewhere that Gemma 3 requires additional memory for image-related tasks, but I can’t find that source anymore. That could be an explanation…?

@AlbertoSinigaglia try 0.6.2

<!-- gh-comment-id:2736710687 --> @ALLMI78 commented on GitHub (Mar 19, 2025): Dear Rick, You're clearly the pro here, and I don’t mean to challenge you, but if someone with 3x A6000 GPUs can’t get Gemma 3 running properly, is that really expected behavior? Your explanation makes sense and is technically correct, but again—I can run 14B models on my GPU with a 32k context without any issues, yet I can't get Gemma 3 (12B) to work, experiencing exactly the symptoms described here. I remember reading somewhere that Gemma 3 requires additional memory for image-related tasks, but I can’t find that source anymore. That could be an explanation…? @AlbertoSinigaglia try 0.6.2
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):

@rick-github, thanks for the clarification; I switched to Environment="OLLAMA_NUM_PARALLEL=1" to see if it makes a difference. I'm uploading a log file with everything saved (for reference, the server has 512 Gb of ram and 64Gb of swap)

Unfortunately, the same happens:

Image

To be fair, the 5-minute wait time is only for the first generation after setting the context size to 128k, though the 1t/s speed still remains.

@ALLMI78 I'm actually running the 27B, yet the behavior is the same using Phi4 with 128k context (which is a 14B model)... I'll try with the newest ollama version

logs.txt

<!-- gh-comment-id:2736751618 --> @AlbertoSinigaglia commented on GitHub (Mar 19, 2025): @rick-github, thanks for the clarification; I switched to `Environment="OLLAMA_NUM_PARALLEL=1"` to see if it makes a difference. I'm uploading a log file with everything saved (for reference, the server has 512 Gb of ram and 64Gb of swap) Unfortunately, the same happens: <img width="415" alt="Image" src="https://github.com/user-attachments/assets/31a001c6-1da9-4ed3-9567-910a764ca3b3" /> To be fair, the 5-minute wait time is only for the first generation after setting the context size to 128k, though the 1t/s speed still remains. @ALLMI78 I'm actually running the 27B, yet the behavior is the same using Phi4 with 128k context (which is a 14B model)... I'll try with the newest ollama version [logs.txt](https://github.com/user-attachments/files/19342143/logs.txt)
Author
Owner

@ALLMI78 commented on GitHub (Mar 19, 2025):

PHI4 with 128k ? sure? does it support that? i only saw 16k versions until now...

to compare you can also try qwen-14b, they run fine for me...

https://ollama.com/library/qwen2.5 (supports up to 128K tokens and has multilingual support)

<!-- gh-comment-id:2736764597 --> @ALLMI78 commented on GitHub (Mar 19, 2025): PHI4 with 128k ? sure? does it support that? i only saw 16k versions until now... to compare you can also try qwen-14b, they run fine for me... https://ollama.com/library/qwen2.5 (supports up to 128K tokens and has multilingual support)
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

I can run 14B models on my GPU with a 32k context without

A token has a different sizes for different models, so 32k context for phi4:14b is 6.2G and for gemma3:12b is 12G. If a model needs to be sharded across several devices, the amount of VRAM required goes up because a copy of the graph needs to run on each device. gemma3 has additional memory requirements for the image projector, so gemma3 is going to consume more VRAM, pushing model layers into system RAM where inference is slower.

mar 19 14:53:35 acquario3 ollama[2433716]: time=2025-03-19T14:53:35.193+01:00 level=INFO source=server.go:138
 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=12 layers.split=3,5,4
 memory.available="[41.3 GiB 47.0 GiB 46.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="297.6 GiB"
 memory.required.partial="129.3 GiB" memory.required.kv="181.6 GiB"
 memory.required.allocations="[40.3 GiB 46.2 GiB 42.8 GiB]" memory.weights.total="207.0 GiB"
 memory.weights.repeating="205.6 GiB" memory.weights.nonrepeating="1.4 GiB"
 memory.graph.full="25.7 GiB" memory.graph.partial="25.7 GiB" projector.weights="818.0 MiB" projector.graph="0 B"

mar 19 14:53:35 acquario3 ollama[2433716]: time=2025-03-19T14:53:35.254+01:00 level=INFO source=server.go:405
 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine
 --model /usr/share/ollama/.ollama/models/blobs/sha256-bdf2532a7fc6115108e3ba005cfe0acb8bb8b0f61f2c2afb59d9ee5d15dc4f47
 --ctx-size 384000 --batch-size 512 --n-gpu-layers 12 --verbose --threads 64 --flash-attn --parallel 3
 --tensor-split 3,5,4 --port 37557"

In this log, OLLAMA_NUM_PARALLEL is 3. This increases the context buffer that ollama allocates to 384000 tokens. That needs 182G. As a result, ollama can only load 12 of the 63 layers of the model in VRAM, the rest being loaded in system RAM where the slower CPU does inference. If OLLAMA_NUM_PARALLEL=1, context cache will fall to approximately 60G, leaving 120G for loading extra model layers in VRAM.

To be fair, the 5-minute wait time is only for the first generation after setting the context size to 128k, though the 1t/s speed still remains.

Changing the context size causes a model reload, that's why the first inference after changing num_ctx takes a while.

<!-- gh-comment-id:2736864864 --> @rick-github commented on GitHub (Mar 19, 2025): > I can run 14B models on my GPU with a 32k context without A token has a different sizes for different models, so 32k context for phi4:14b is 6.2G and for gemma3:12b is 12G. If a model needs to be sharded across several devices, the amount of VRAM required goes up because a copy of the graph needs to run on each device. gemma3 has additional memory requirements for the image projector, so gemma3 is going to consume more VRAM, pushing model layers into system RAM where inference is slower. ``` mar 19 14:53:35 acquario3 ollama[2433716]: time=2025-03-19T14:53:35.193+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=12 layers.split=3,5,4 memory.available="[41.3 GiB 47.0 GiB 46.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="297.6 GiB" memory.required.partial="129.3 GiB" memory.required.kv="181.6 GiB" memory.required.allocations="[40.3 GiB 46.2 GiB 42.8 GiB]" memory.weights.total="207.0 GiB" memory.weights.repeating="205.6 GiB" memory.weights.nonrepeating="1.4 GiB" memory.graph.full="25.7 GiB" memory.graph.partial="25.7 GiB" projector.weights="818.0 MiB" projector.graph="0 B" mar 19 14:53:35 acquario3 ollama[2433716]: time=2025-03-19T14:53:35.254+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-bdf2532a7fc6115108e3ba005cfe0acb8bb8b0f61f2c2afb59d9ee5d15dc4f47 --ctx-size 384000 --batch-size 512 --n-gpu-layers 12 --verbose --threads 64 --flash-attn --parallel 3 --tensor-split 3,5,4 --port 37557" ``` In this log, `OLLAMA_NUM_PARALLEL` is 3. This increases the context buffer that ollama allocates to 384000 tokens. That needs 182G. As a result, ollama can only load 12 of the 63 layers of the model in VRAM, the rest being loaded in system RAM where the slower CPU does inference. If `OLLAMA_NUM_PARALLEL=1`, context cache will fall to approximately 60G, leaving 120G for loading extra model layers in VRAM. > To be fair, the 5-minute wait time is only for the first generation after setting the context size to 128k, though the 1t/s speed still remains. Changing the context size causes a model reload, that's why the first inference after changing `num_ctx` takes a while.
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):

@ALLMI78 Sure, you are right, though I was not looking for a nice generation, only compute time requirements, and I had only that one as a "small" model.

@rick-github, you're right; I've mistakenly changed num models from 3 to 1 instead of num parallel (ops...)

Now the loading is pretty fast, though it allocates 20Gb for each GPU... which aligns with your 60Gb of predicted memory usage. However, It's still not quite not clear to be why ollama should preallocate the whole 128k context instead of dynamically creating it based on the actual context (so starting from 0 to N with time)... Does it have anything to do with KV caching?

<!-- gh-comment-id:2736935255 --> @AlbertoSinigaglia commented on GitHub (Mar 19, 2025): @ALLMI78 Sure, you are right, though I was not looking for a nice generation, only compute time requirements, and I had only that one as a "small" model. @rick-github, you're right; I've mistakenly changed num models from 3 to 1 instead of num parallel (ops...) Now the loading is pretty fast, though it allocates 20Gb for each GPU... which aligns with your 60Gb of predicted memory usage. However, It's still not quite not clear to be why ollama should preallocate the whole 128k context instead of dynamically creating it based on the actual context (so starting from 0 to N with time)... Does it have anything to do with KV caching?
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):

[OT] @ALLMI78 https://huggingface.co/microsoft/Phi-4-multimodal-instruct WELL, new version just dropped eheh

<!-- gh-comment-id:2737699558 --> @AlbertoSinigaglia commented on GitHub (Mar 19, 2025): [OT] @ALLMI78 https://huggingface.co/microsoft/Phi-4-multimodal-instruct WELL, new version just dropped eheh
Author
Owner

@ALLMI78 commented on GitHub (Mar 19, 2025):

@rick-github Dear Rick,

Thank you for the detailed explanations. Yes, you had already explained the OLLAMA_NUM_PARALLEL = 1/3, and I understood that.

I had already noticed that different models use different tokenizers. For example, i saw that Llama models require significantly fewer tokens than Qwen models (about 30-40% less). I understood all your explanations and think everything is absolutely correct—except for one point…

With a Gemma 3 12B Q4_K_M and 32K context, I am seeing 24 GB memory usage. The model itself is around 8 GB, so that would mean another 16 GB for the 32K context, including the image projector?

For Qwen 14B Q4_K_M with 8GB size, I get 16-17 GB memory usage with 32K context, so 8 for the model and 8 for the 32k context.

Because they told us "The current, most capable model that runs on a single GPU. " and since there were initial RAM issues with Gemma 3, I remained a bit skeptical. I'm not able to run the 12b (or only @ 1t/s) and the 4b does also not work for me, it answers crap.

I trust you—if you say everything fits and is normal, I accept that. I just noticed that the symptoms here were exactly the same as for me before: too much memory usage → swapping to CPU → slow speed (~1 token/s), exactly like in my case.

@AlbertoSinigaglia

<!-- gh-comment-id:2738349697 --> @ALLMI78 commented on GitHub (Mar 19, 2025): @rick-github Dear Rick, Thank you for the detailed explanations. Yes, you had already explained the `OLLAMA_NUM_PARALLEL = 1/3`, and I understood that. I had already noticed that different models use different tokenizers. For example, i saw that Llama models require significantly fewer tokens than Qwen models (about 30-40% less). I understood all your explanations and think everything is absolutely correct—except for one point… With a **Gemma 3 12B Q4_K_M** and **32K context**, I am seeing **24 GB memory usage**. The model itself is around **8 GB**, so that would mean another **16 GB for the 32K context**, including the **image projector**? For **Qwen 14B Q4_K_M** with 8GB size, I get **16-17 GB** memory usage with **32K context**, so 8 for the model and 8 for the 32k context. Because they told us "The current, most capable model that runs on a single GPU. " and since there were initial RAM issues with **Gemma 3**, I remained a bit skeptical. I'm not able to run the 12b (or only @ 1t/s) and the 4b does also not work for me, it answers crap. I trust you—if you say everything fits and is normal, I accept that. I just noticed that the symptoms here were exactly the same as for me before: **too much memory usage → swapping to CPU → slow speed (~1 token/s),** exactly like in my case. @AlbertoSinigaglia - I saw **Phi-4 Multimodal**, but there’s no **GGUF** yet, and I had issues converting it myself. https://huggingface.co/models?sort=trending&search=phi-4+multimodal+gguf - Does everything work normally now? Inference speed all good?
Author
Owner

@ALLMI78 commented on GitHub (Mar 19, 2025):

### OT -> preallocate

I think it won’t work without preallocate or it is hard to get a stable solution. You can’t change the context size during runtime.

Imagine you start with 2K context and you later want to increase it, but in the meantime, the user has loaded another software that blocks VRAM, or some process is using it.

If you now need to dynamically increase memory usage because the context grows, it won’t work—suddenly, there’s no memory available.

A simple explanation—if something is incorrect, Rick can probably explain it better. ;)

<!-- gh-comment-id:2738410435 --> @ALLMI78 commented on GitHub (Mar 19, 2025): **### _OT -> preallocate_** I think it won’t work without **preallocate** or it is hard to get a stable solution. You can’t change the context size during runtime. Imagine you start with **2K context** and you later want to increase it, but in the meantime, the user has loaded another software that blocks VRAM, or some process is using it. If you now need to **dynamically** increase memory usage because the context grows, it won’t work—suddenly, there’s no memory available. A simple explanation—if something is incorrect, Rick can probably explain it better. ;)
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):

@ALLMI78 maybe they meant an H100 as "single GPU"...

Anyway, it still takes a lot, and also the Gemma models afaik are notorious for absurd tokenizers, because they work great on TPUs (not as much on GPUs), at least that's what I was told

On the GGUF, I'm still not quite at the point were I'm comfortable "making a GGUF version", still quite new to the world of LLMs. (as an OT, does it make such a large difference?)

About the RAG... sort of, it still takes a while for the first generation. Connecting to your last comment, I'm not sure that is just the allocation the problem. Allocating 40GBs in the GPU takes just few seconds, loading a 40Gb model in GPU takes like 8/10 seconds, but allocating a 20GB model with a large context (overall 40Gb mem required) takes waaay longer...

<!-- gh-comment-id:2738432921 --> @AlbertoSinigaglia commented on GitHub (Mar 19, 2025): @ALLMI78 maybe they meant an H100 as "single GPU"... Anyway, it still takes a lot, and also the Gemma models afaik are notorious for absurd tokenizers, because they work great on TPUs (not as much on GPUs), at least that's what I was told On the GGUF, I'm still not quite at the point were I'm comfortable "making a GGUF version", still quite new to the world of LLMs. (as an OT, does it make such a large difference?) About the RAG... sort of, it still takes a while for the first generation. Connecting to your last comment, I'm not sure that is just the allocation the problem. Allocating 40GBs in the GPU takes just few seconds, loading a 40Gb model in GPU takes like 8/10 seconds, but allocating a 20GB model with a large context (overall 40Gb mem required) takes waaay longer...
Author
Owner

@ALLMI78 commented on GitHub (Mar 19, 2025):

Off-topic: The H100 costs around 38,000 euros here—I have no idea what Google expects. ;)

I’m already amazed how some people (like you) have three GPUs, when even one costs 5,000 euros.

Do you all buy them used, steal them, or am I just too dumb or too poor? ;)

<!-- gh-comment-id:2738505592 --> @ALLMI78 commented on GitHub (Mar 19, 2025): Off-topic: The **H100** costs around **38,000 euros** here—I have no idea what Google expects. ;) I’m already amazed how some people (like you) have **three GPUs**, when even **one** costs **5,000 euros**. Do you all buy them **used**, **steal them**, or am I just **too dumb or too poor**? ;)
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 20, 2025):

@ALLMI78 sooo I've re-downloaded all models in their GGUF version, I'm not seeing a major different in load speed, but a small speedbump on the generation side (though the Gemma3 model I was testing was already a GGUF version)... regarding the GPUs, the answer is "I'm in academia"

@rick-github Any suggestion on how to reduce the slowdown due by the context size? Is it impossible to have a sort of "gradual" allocation of the context size? For example, allocating 8k, and once the model used the first 4k, you get other 8k token allocation and so on?

<!-- gh-comment-id:2738611983 --> @AlbertoSinigaglia commented on GitHub (Mar 20, 2025): @ALLMI78 sooo I've re-downloaded all models in their GGUF version, I'm not seeing a major different in load speed, but a small speedbump on the generation side (though the Gemma3 model I was testing was already a GGUF version)... regarding the GPUs, the answer is "I'm in academia" @rick-github Any suggestion on how to reduce the slowdown due by the context size? Is it impossible to have a sort of "gradual" allocation of the context size? For example, allocating 8k, and once the model used the first 4k, you get other 8k token allocation and so on?
Author
Owner

@rick-github commented on GitHub (Mar 20, 2025):

However, It's still not quite not clear to be why ollama should preallocate the whole 128k context instead of dynamically creating it

It's easier. The positional encodings used in the context buffer are based on continuous sine/cosine functions and the attention mask is computed on-the-fly so there's no technical reason for constraining the window (other than the training size of the model), but it makes it easier to manage. For most applications, allocating the max size on the GPU is simpler than trying to manage memory and competing with other clients. The downside is that if you need more context than is available on the GPU, you get a performance hit. There are mechanisms to cope with this like flash attention and sliding window optimization (#9892). gemma3 is a new model and has some unique architectural features so there's some teething problems as it's integrated with the also new architecture of the ollama runners, but the next few releases should see improvements.

With a Gemma 3 12B Q4_K_M and 32K context, I am seeing 24 GB memory usage. The model itself is around 8 GB, so that would mean another 16 GB for the 32K context, including the image projector?

context 12G, model graph 1.1G, projector 0.8G, projector graph 1G.

Because they told us "The current, most capable model that runs on a single GPU. "

As Alberto says, what an enterprise considers a GPU is different to consumer hardware offerings.

I trust you—if you say everything fits and is normal, I accept that.

That's not to say there aren't issues. 0.6.1 uses too much system RAM, which has been addressed in 0.6.2. There are still issues with large VRAM allocations happening during inference, crashing the runner (#9791). q4_0 and q8_0 cache quantization has a significant performance hit (#9683) for gemma3.

Any suggestion on how to reduce the slowdown due by the context size

Flash attention and the sliding window PR should improve things.

Is it impossible to have a sort of "gradual" allocation of the context size?

The current architecture doesn't support that in ollama, but this is simple enough to manage in the client. When the client sends a request, if can set the context via num_ctx. The response will indicate how many tokens the prompt and the response took, and the client can adjust num_ctx if required. This will cause a model reload when the size crosses the maximum available on the GPU, but the model is already in page cache so the reload time will be short(ish).

<!-- gh-comment-id:2740319483 --> @rick-github commented on GitHub (Mar 20, 2025): > However, It's still not quite not clear to be why ollama should preallocate the whole 128k context instead of dynamically creating it It's easier. The positional encodings used in the context buffer are based on continuous sine/cosine functions and the attention mask is computed on-the-fly so there's no technical reason for constraining the window (other than the training size of the model), but it makes it easier to manage. For most applications, allocating the max size on the GPU is simpler than trying to manage memory and competing with other clients. The downside is that if you need more context than is available on the GPU, you get a performance hit. There are mechanisms to cope with this like flash attention and sliding window optimization (#9892). gemma3 is a new model and has some unique architectural features so there's some teething problems as it's integrated with the also new architecture of the ollama runners, but the next few releases should see improvements. > With a Gemma 3 12B Q4_K_M and 32K context, I am seeing 24 GB memory usage. The model itself is around 8 GB, so that would mean another 16 GB for the 32K context, including the image projector? context 12G, model graph 1.1G, projector 0.8G, projector graph 1G. > Because they told us "The current, most capable model that runs on a single GPU. " As Alberto says, what an enterprise considers a GPU is different to consumer hardware offerings. > I trust you—if you say everything fits and is normal, I accept that. That's not to say there aren't issues. 0.6.1 uses too much system RAM, which has been addressed in 0.6.2. There are still issues with large VRAM allocations happening during inference, crashing the runner (#9791). q4_0 and q8_0 cache quantization has a significant performance hit (#9683) for gemma3. > Any suggestion on how to reduce the slowdown due by the context size Flash attention and the sliding window PR should improve things. > Is it impossible to have a sort of "gradual" allocation of the context size? The current architecture doesn't support that in ollama, but this is simple enough to manage in the client. When the client sends a request, if can set the context via `num_ctx`. The response will indicate how many tokens the prompt and the response took, and the client can adjust `num_ctx` if required. This will cause a model reload when the size crosses the maximum available on the GPU, but the model is already in page cache so the reload time will be short(ish).
Author
Owner

@ALLMI78 commented on GitHub (Mar 20, 2025):

Wow, thank you so much, Rick, for the detailed insights and the links! The point about "Sliding Window Optimization" is particularly interesting. I've often wondered whether some of the techniques used by Unsloth (they write something about 6x times more context size and so on...) could also be applied in Ollama. I'm not deeply familiar with the topic, but maybe there's still some potential there? Perhaps they are already collaborating with you, or at least exchanging ideas—if that’s not happening already?

"Context: 12 GB, Model Graph: 1.1 GB, Projector: 0.8 GB, Projector Graph: 1 GB."
Very interesting, thanks! I was just a bit confused because I couldn't get a "small" 12B model to run. I mean, if you try loading a "no-name" 12B model and it doesn’t work, that’s kind of expected. But since Gemma2 worked great for me—until the limited context size became an issue—I was really excited about Gemma3. And then came the disappointment. You spend hours trying different things, and nothing works. Of course, that’s part of the game, but I honestly didn’t expect it in this case.

The idea of managing this from the client side is interesting—thanks for the inspirations and for taking the time to explain everything! It was really insightful. :)

<!-- gh-comment-id:2740377419 --> @ALLMI78 commented on GitHub (Mar 20, 2025): Wow, thank you so much, Rick, for the detailed insights and the links! The point about "Sliding Window Optimization" is particularly interesting. I've often wondered whether some of the techniques used by Unsloth (they write something about 6x times more context size and so on...) could also be applied in Ollama. I'm not deeply familiar with the topic, but maybe there's still some potential there? Perhaps they are already collaborating with you, or at least exchanging ideas—if that’s not happening already? "Context: 12 GB, Model Graph: 1.1 GB, Projector: 0.8 GB, Projector Graph: 1 GB." Very interesting, thanks! I was just a bit confused because I couldn't get a "small" 12B model to run. I mean, if you try loading a "no-name" 12B model and it doesn’t work, that’s kind of expected. But since Gemma2 worked great for me—until the limited context size became an issue—I was really excited about Gemma3. And then came the disappointment. You spend hours trying different things, and nothing works. Of course, that’s part of the game, but I honestly didn’t expect it in this case. The idea of managing this from the client side is interesting—thanks for the inspirations and for taking the time to explain everything! It was really insightful. :)
Author
Owner

@ALLMI78 commented on GitHub (Mar 20, 2025):

Offtopic – Personal

Rick, thank you—and thanks to everyone else here—who contributes out of passion and a genuine desire to help. The topic of AI is far too important to be left solely in the hands of a few ultra-rich individuals or hidden behind paywalls. Every small explanation, every constructive discussion, every shared image, and every bit of information about how things work helps to open the door to this new world for non-academics like me. Keep up the great work! :)

<!-- gh-comment-id:2740462769 --> @ALLMI78 commented on GitHub (Mar 20, 2025): **Offtopic – Personal** Rick, thank you—and thanks to everyone else here—who contributes out of passion and a genuine desire to help. The topic of AI is far too important to be left solely in the hands of a few ultra-rich individuals or hidden behind paywalls. Every small explanation, every constructive discussion, every shared image, and every bit of information about how things work helps to open the door to this new world for non-academics like me. Keep up the great work! :)
Author
Owner

@rick-github commented on GitHub (Mar 20, 2025):

or at least exchanging ideas

LLMs are an evolving field, and still relatively young. There's lots of research going into improving training and inference, as those ideas and implementations mature they will be integrated into various code bases.

I was really excited about Gemma3

Yeah, as a small, capable model that does vision and (unofficially) tools, this is looking like a great foot soldier for the upcoming Agentic Wars. I think some of the problems actually come down to timing - a new model with novel architectural features and a revamp of the ollama runner architecture at the same time resulted in a less than stellar launch.

<!-- gh-comment-id:2740571320 --> @rick-github commented on GitHub (Mar 20, 2025): > or at least exchanging ideas LLMs are an evolving field, and still relatively young. There's lots of research going into improving training and inference, as those ideas and implementations mature they will be integrated into various code bases. > I was really excited about Gemma3 Yeah, as a small, capable model that does vision and (unofficially) tools, this is looking like a great foot soldier for the upcoming Agentic Wars. I think some of the problems actually come down to timing - a new model with novel architectural features _and_ a revamp of the ollama runner architecture at the same time resulted in a less than stellar launch.
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 21, 2025):

I deeply share the gratitude of @ALLMI78, @rick-github thanks again for the explanation. I actually came here from this issue #11828, and only later realized that the problem was the context length, so I'll probably open a new issue on OpenWebUI to see if they can manage to implement that "simple" dynamic context length

I'll keep an eye on #9892, looks very promising

<!-- gh-comment-id:2743507718 --> @AlbertoSinigaglia commented on GitHub (Mar 21, 2025): I deeply share the gratitude of @ALLMI78, @rick-github thanks again for the explanation. I actually came here from this issue [#11828](https://github.com/open-webui/open-webui/issues/11828#issuecomment-2735998669), and only later realized that the problem was the context length, so I'll probably open a new issue on OpenWebUI to see if they can manage to implement that "simple" dynamic context length I'll keep an eye on #9892, looks very promising
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 23, 2025):

Hi @rick-github, i've seen the pre-release with the sliding window... may I ask why that's available only for gemma3? For example, also Llama3.3 Instruct has 130k context length. From a "Web client perspective", I think it would be nice to have a request parameter to request fixed or dynamic allocation of the context, like what num_ctx does now for the context

Side question: how do you recognize if a model is a Gemma3 model? E.g. would that work also with unsloth gemma3 model?

<!-- gh-comment-id:2746095399 --> @AlbertoSinigaglia commented on GitHub (Mar 23, 2025): Hi @rick-github, i've seen the pre-release with the sliding window... may I ask why that's available only for gemma3? For example, also Llama3.3 Instruct has 130k context length. From a "Web client perspective", I think it would be nice to have a request parameter to request fixed or dynamic allocation of the context, like what `num_ctx` does now for the context Side question: how do you recognize if a model is a Gemma3 model? E.g. would that work also with unsloth gemma3 model?
Author
Owner

@rick-github commented on GitHub (Mar 23, 2025):

may I ask why that's available only for gemma3?

It's a feature of the modified architecture that gemma3 has. To quote from the technical report:

A challenge with long context is the memory explosion of the KV cache during inference. To reduce this issue, we interleave multiple local layers between each global layer, and assign a smaller span of only 1024 tokens to the local layers. Therefore, only the global layers attend to long context, and we have 1 global for every 5 local layers.

For example, also Llama3.3 Instruct has 130k context length.

Architectural changes can't be backported to an existing model, but now that the Deepmind team have demonstrated the advantages of this approach, new models may adopt it. Probably not llama3.4, maybe llama4. Or they could come up with their own architectural tweaks.

Side question: how do you recognize if a model is a Gemma3 model? E.g. would that work also with unsloth gemma3 model?

It should work for all derivations of gemma3. I'm not familiar with what unsloth does to their model releases, so I can't say "will work".

<!-- gh-comment-id:2746484608 --> @rick-github commented on GitHub (Mar 23, 2025): > may I ask why that's available only for gemma3? It's a feature of the modified architecture that gemma3 has. To quote from the [technical report](https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf): > A challenge with long context is the memory explosion of the KV cache during inference. To reduce this issue, we interleave multiple local layers between each global layer, and assign a smaller span of only 1024 tokens to the local layers. Therefore, only the global layers attend to long context, and we have 1 global for every 5 local layers. > For example, also Llama3.3 Instruct has 130k context length. Architectural changes can't be backported to an existing model, but now that the Deepmind team have demonstrated the advantages of this approach, new models may adopt it. Probably not llama3.4, maybe llama4. Or they could come up with their own architectural tweaks. > Side question: how do you recognize if a model is a Gemma3 model? E.g. would that work also with unsloth gemma3 model? It _should_ work for all derivations of gemma3. I'm not familiar with what unsloth does to their model releases, so I can't say "_will_ work".
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 24, 2025):

@rick-github I though that the PR was aiming at building something like vLLM Paged Attention, which (at first glance) looks like the "more sound" approach to this problem, instead of relying on Google (or whoever in other cases), which might not care that much given their hardware availability

Do you feel that Paged Attention might still be coming to Ollama?

<!-- gh-comment-id:2746578099 --> @AlbertoSinigaglia commented on GitHub (Mar 24, 2025): @rick-github I though that the PR was aiming at building something like vLLM Paged Attention, which (at first glance) looks like the "more sound" approach to this problem, instead of relying on Google (or whoever in other cases), which might not care that much given their hardware availability Do you feel that Paged Attention might still be coming to Ollama?
Author
Owner

@jessegross commented on GitHub (Mar 24, 2025):

Sliding window attention and paged attention are mostly orthogonal - they can done separately or together.

Paged attention mostly helps with memory management in multi-request scenarios, whereas sliding window attention reduces the effective context size (memory/GPU cost) for each request. The latter is the one that is more helpful for the discussion here.

As Rick said, sliding window attention is an architectural feature of Gemma and not something that can just be turned on for other models. In fact, sliding window attention was implemented in Gemma from the first release, the upcoming release just has a more optimized implementation of it.

<!-- gh-comment-id:2749001878 --> @jessegross commented on GitHub (Mar 24, 2025): Sliding window attention and paged attention are mostly orthogonal - they can done separately or together. Paged attention mostly helps with memory management in multi-request scenarios, whereas sliding window attention reduces the effective context size (memory/GPU cost) for each request. The latter is the one that is more helpful for the discussion here. As Rick said, sliding window attention is an architectural feature of Gemma and not something that can just be turned on for other models. In fact, sliding window attention was implemented in Gemma from the first release, the upcoming release just has a more optimized implementation of it.
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 24, 2025):

@jessegross citing the original paper:

Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into KV blocks.

I'm not expert enough to be confident about it, but I'd throw there an educated guess, saying that I don't see how this can be extended to same-sequence/single-request scenario. At least from the paper, it seems just an efficient way to "see" a non-contiguous tensor as contiguous, thus allowing to first allocate (for example) an 8k context vector, and then increase it if the LLM is getting close to that limit, without having to reload the whole LLM or the tensor (instead, you just need to allocate a second 8k long tensor to use as context extension).

Though, if this is not the case, feel free to correct me.

<!-- gh-comment-id:2749369647 --> @AlbertoSinigaglia commented on GitHub (Mar 24, 2025): @jessegross citing the original paper: > Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into KV blocks. I'm not expert enough to be confident about it, but I'd throw there an educated guess, saying that I don't see how this can be extended to same-sequence/single-request scenario. At least from the paper, it seems just an efficient way to "see" a non-contiguous tensor as contiguous, thus allowing to first allocate (for example) an 8k context vector, and then increase it if the LLM is getting close to that limit, without having to reload the whole LLM or the tensor (instead, you just need to allocate a second 8k long tensor to use as context extension). Though, if this is not the case, feel free to correct me.
Author
Owner

@AlbertoSinigaglia commented on GitHub (Apr 3, 2025):

Came back here to this closed issue just to say that the new Gemma3 runtime is amazing, and all of you maintaining this project are such amazing, thank you so much!!!

(@rick-github)

<!-- gh-comment-id:2775827931 --> @AlbertoSinigaglia commented on GitHub (Apr 3, 2025): Came back here to this closed issue just to say that the new Gemma3 runtime is amazing, and all of you maintaining this project are such amazing, thank you so much!!! (@rick-github)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32236