[GH-ISSUE #9890] Large context size completely breaks the usability of the model #68533

New Issue

GiteaMirror · 2026-05-04T14:19:24-05:00

GiteaMirror commented

2026-05-04 14:19:24 -05:00

Originally created by @AlbertoSinigaglia on GitHub (Mar 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9890

What is the issue?

Just installed Gemma3, with context length 131072, and just learned that it means nothing to Ollama, since it still works with 2048 context size if not specified otherwise.

So, if i run it with the default context, it runs smoothly and loads the model correctly on a single GPU, and outputs what's is supposed to in a matter of seconds.

and has no problem at all at answering.

However, as soon as I run /set parameter num_ctx 128000, it shards the model across GPUS, and never answers ever again.

Context:
I'm running it on a server with 3x A6000, using the following config in systemctl edit ollama

Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_NUM_PARALLEL=3"

Relevant log output

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.6.0

Originally created by @AlbertoSinigaglia on GitHub (Mar 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9890 ### What is the issue? Just installed [Gemma3](hf.co/unsloth/gemma-3-27b-it-GGUF:Q8_0), with `context length 131072`, and just learned that [it means nothing to Ollama](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size), since it still works with 2048 context size if not specified otherwise. So, if i run it with the default context, it runs smoothly and loads the model correctly on a single GPU, and outputs what's is supposed to in a matter of seconds. <img width="344" alt="Image" src="https://github.com/user-attachments/assets/bac07c99-433b-4ab6-a66d-33d86136fcb2" /> and has no problem at all at answering. <img width="205" alt="Image" src="https://github.com/user-attachments/assets/c7a8c553-dbed-4092-8a9c-cffb6a7f228e" /> However, as soon as I run `/set parameter num_ctx 128000`, it shards the model across GPUS, and never answers ever again. <img width="422" alt="Image" src="https://github.com/user-attachments/assets/bb1ac034-fc9d-49c3-a797-9b009c6bdaa3" /> <img width="269" alt="Image" src="https://github.com/user-attachments/assets/20c2a076-22c1-4b01-b17b-a0977142b2c3" /> ---- Context: I'm running it on a server with 3x A6000, using the following config in `systemctl edit ollama` ``` Environment="OLLAMA_DEBUG=1" Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_MAX_LOADED_MODELS=3" Environment="OLLAMA_NUM_PARALLEL=3" ``` ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.0

GiteaMirror added the bug label 2026-05-04 14:19:24 -05:00

GiteaMirror closed this issue

2026-05-04 14:19:25 -05:00

GiteaMirror commented

2026-05-04 14:19:27 -05:00

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):

To be fair, the model after like 5 minutes answers, yet it works at 1t/s, where initially is around 10t/s... I would understand it if the context was actually 128K, but since both the versions work with almost no context in input, feels weird that with a longer context it has such a slowdown

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025): To be fair, the model after like 5 minutes answers, yet it works at 1t/s, where initially is around 10t/s... I would understand it if the context was actually 128K, but since both the versions work with almost no context in input, feels weird that with a longer context it has such a slowdown

GiteaMirror commented

2026-05-04 14:19:28 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

Environment="OLLAMA_NUM_PARALLEL=3"

This is tripling the size of the context that ollama allocates on the GPU. This is causing a bunch of layers to be loaded in to system RAM, and inference is much slower. Server logs will show how much VRAM the increased context size is consuming.

@rick-github commented on GitHub (Mar 19, 2025): ``` Environment="OLLAMA_NUM_PARALLEL=3" ``` This is tripling the size of the context that ollama allocates on the GPU. This is causing a bunch of layers to be loaded in to system RAM, and inference is much slower. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will show how much VRAM the increased context size is consuming.

GiteaMirror commented

2026-05-04 14:19:29 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

https://github.com/ollama/ollama/issues/9791#issuecomment-2727576383

@rick-github commented on GitHub (Mar 19, 2025): https://github.com/ollama/ollama/issues/9791#issuecomment-2727576383

GiteaMirror commented

2026-05-04 14:19:30 -05:00

@ALLMI78 commented on GitHub (Mar 19, 2025):

"the model after like 5 minutes answers, yet it works at 1t/s"

Thats exactly what i was seeing in my setup with a 32k context... I feel like there are still issues with larger context sizes, but since I can't break down my entire workflow into a reproducible example, I’m having trouble presenting the problem. Maybe you can... I also wasn’t sure if my GPU (4060 Ti 16GB) is simply too weak for Gemma 3.

@ALLMI78 commented on GitHub (Mar 19, 2025): "the model after like 5 minutes answers, yet it works at 1t/s" Thats exactly what i was seeing in my setup with a 32k context... I feel like there are still issues with larger context sizes, but since I can't break down my entire workflow into a reproducible example, I’m having trouble presenting the problem. Maybe you can... I also wasn’t sure if my GPU (4060 Ti 16GB) is simply too weak for Gemma 3.

GiteaMirror commented

2026-05-04 14:19:31 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

More context == less room for model weights in VRAM. Less room for model weights in VRAM == model weights in system RAM. Model weights in system RAM == slower inference.

@rick-github commented on GitHub (Mar 19, 2025): More context == less room for model weights in VRAM. Less room for model weights in VRAM == model weights in system RAM. Model weights in system RAM == slower inference.

GiteaMirror commented

2026-05-04 14:19:32 -05:00

@ALLMI78 commented on GitHub (Mar 19, 2025):

Dear Rick,

You're clearly the pro here, and I don’t mean to challenge you, but if someone with 3x A6000 GPUs can’t get Gemma 3 running properly, is that really expected behavior?

Your explanation makes sense and is technically correct, but again—I can run 14B models on my GPU with a 32k context without any issues, yet I can't get Gemma 3 (12B) to work, experiencing exactly the symptoms described here.

I remember reading somewhere that Gemma 3 requires additional memory for image-related tasks, but I can’t find that source anymore. That could be an explanation…?

@AlbertoSinigaglia try 0.6.2

@ALLMI78 commented on GitHub (Mar 19, 2025): Dear Rick, You're clearly the pro here, and I don’t mean to challenge you, but if someone with 3x A6000 GPUs can’t get Gemma 3 running properly, is that really expected behavior? Your explanation makes sense and is technically correct, but again—I can run 14B models on my GPU with a 32k context without any issues, yet I can't get Gemma 3 (12B) to work, experiencing exactly the symptoms described here. I remember reading somewhere that Gemma 3 requires additional memory for image-related tasks, but I can’t find that source anymore. That could be an explanation…? @AlbertoSinigaglia try 0.6.2

GiteaMirror commented

2026-05-04 14:19:33 -05:00

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):

@rick-github, thanks for the clarification; I switched to Environment="OLLAMA_NUM_PARALLEL=1" to see if it makes a difference. I'm uploading a log file with everything saved (for reference, the server has 512 Gb of ram and 64Gb of swap)

Unfortunately, the same happens:

To be fair, the 5-minute wait time is only for the first generation after setting the context size to 128k, though the 1t/s speed still remains.

@ALLMI78 I'm actually running the 27B, yet the behavior is the same using Phi4 with 128k context (which is a 14B model)... I'll try with the newest ollama version

logs.txt

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025): @rick-github, thanks for the clarification; I switched to `Environment="OLLAMA_NUM_PARALLEL=1"` to see if it makes a difference. I'm uploading a log file with everything saved (for reference, the server has 512 Gb of ram and 64Gb of swap) Unfortunately, the same happens: <img width="415" alt="Image" src="https://github.com/user-attachments/assets/31a001c6-1da9-4ed3-9567-910a764ca3b3" /> To be fair, the 5-minute wait time is only for the first generation after setting the context size to 128k, though the 1t/s speed still remains. @ALLMI78 I'm actually running the 27B, yet the behavior is the same using Phi4 with 128k context (which is a 14B model)... I'll try with the newest ollama version [logs.txt](https://github.com/user-attachments/files/19342143/logs.txt)

GiteaMirror commented

2026-05-04 14:19:34 -05:00

@ALLMI78 commented on GitHub (Mar 19, 2025):

PHI4 with 128k ? sure? does it support that? i only saw 16k versions until now...

to compare you can also try qwen-14b, they run fine for me...

https://ollama.com/library/qwen2.5 (supports up to 128K tokens and has multilingual support)

@ALLMI78 commented on GitHub (Mar 19, 2025): PHI4 with 128k ? sure? does it support that? i only saw 16k versions until now... to compare you can also try qwen-14b, they run fine for me... https://ollama.com/library/qwen2.5 (supports up to 128K tokens and has multilingual support)

GiteaMirror commented

2026-05-04 14:19:34 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

I can run 14B models on my GPU with a 32k context without

A token has a different sizes for different models, so 32k context for phi4:14b is 6.2G and for gemma3:12b is 12G. If a model needs to be sharded across several devices, the amount of VRAM required goes up because a copy of the graph needs to run on each device. gemma3 has additional memory requirements for the image projector, so gemma3 is going to consume more VRAM, pushing model layers into system RAM where inference is slower.

mar 19 14:53:35 acquario3 ollama[2433716]: time=2025-03-19T14:53:35.193+01:00 level=INFO source=server.go:138
 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=12 layers.split=3,5,4
 memory.available="[41.3 GiB 47.0 GiB 46.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="297.6 GiB"
 memory.required.partial="129.3 GiB" memory.required.kv="181.6 GiB"
 memory.required.allocations="[40.3 GiB 46.2 GiB 42.8 GiB]" memory.weights.total="207.0 GiB"
 memory.weights.repeating="205.6 GiB" memory.weights.nonrepeating="1.4 GiB"
 memory.graph.full="25.7 GiB" memory.graph.partial="25.7 GiB" projector.weights="818.0 MiB" projector.graph="0 B"

mar 19 14:53:35 acquario3 ollama[2433716]: time=2025-03-19T14:53:35.254+01:00 level=INFO source=server.go:405
 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine
 --model /usr/share/ollama/.ollama/models/blobs/sha256-bdf2532a7fc6115108e3ba005cfe0acb8bb8b0f61f2c2afb59d9ee5d15dc4f47
 --ctx-size 384000 --batch-size 512 --n-gpu-layers 12 --verbose --threads 64 --flash-attn --parallel 3
 --tensor-split 3,5,4 --port 37557"

In this log, OLLAMA_NUM_PARALLEL is 3. This increases the context buffer that ollama allocates to 384000 tokens. That needs 182G. As a result, ollama can only load 12 of the 63 layers of the model in VRAM, the rest being loaded in system RAM where the slower CPU does inference. If OLLAMA_NUM_PARALLEL=1, context cache will fall to approximately 60G, leaving 120G for loading extra model layers in VRAM.

To be fair, the 5-minute wait time is only for the first generation after setting the context size to 128k, though the 1t/s speed still remains.

Changing the context size causes a model reload, that's why the first inference after changing num_ctx takes a while.

@rick-github commented on GitHub (Mar 19, 2025): > I can run 14B models on my GPU with a 32k context without A token has a different sizes for different models, so 32k context for phi4:14b is 6.2G and for gemma3:12b is 12G. If a model needs to be sharded across several devices, the amount of VRAM required goes up because a copy of the graph needs to run on each device. gemma3 has additional memory requirements for the image projector, so gemma3 is going to consume more VRAM, pushing model layers into system RAM where inference is slower. ``` mar 19 14:53:35 acquario3 ollama[2433716]: time=2025-03-19T14:53:35.193+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=12 layers.split=3,5,4 memory.available="[41.3 GiB 47.0 GiB 46.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="297.6 GiB" memory.required.partial="129.3 GiB" memory.required.kv="181.6 GiB" memory.required.allocations="[40.3 GiB 46.2 GiB 42.8 GiB]" memory.weights.total="207.0 GiB" memory.weights.repeating="205.6 GiB" memory.weights.nonrepeating="1.4 GiB" memory.graph.full="25.7 GiB" memory.graph.partial="25.7 GiB" projector.weights="818.0 MiB" projector.graph="0 B" mar 19 14:53:35 acquario3 ollama[2433716]: time=2025-03-19T14:53:35.254+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-bdf2532a7fc6115108e3ba005cfe0acb8bb8b0f61f2c2afb59d9ee5d15dc4f47 --ctx-size 384000 --batch-size 512 --n-gpu-layers 12 --verbose --threads 64 --flash-attn --parallel 3 --tensor-split 3,5,4 --port 37557" ``` In this log, `OLLAMA_NUM_PARALLEL` is 3. This increases the context buffer that ollama allocates to 384000 tokens. That needs 182G. As a result, ollama can only load 12 of the 63 layers of the model in VRAM, the rest being loaded in system RAM where the slower CPU does inference. If `OLLAMA_NUM_PARALLEL=1`, context cache will fall to approximately 60G, leaving 120G for loading extra model layers in VRAM. > To be fair, the 5-minute wait time is only for the first generation after setting the context size to 128k, though the 1t/s speed still remains. Changing the context size causes a model reload, that's why the first inference after changing `num_ctx` takes a while.

GiteaMirror commented

2026-05-04 14:19:36 -05:00

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):

@ALLMI78 Sure, you are right, though I was not looking for a nice generation, only compute time requirements, and I had only that one as a "small" model.

@rick-github, you're right; I've mistakenly changed num models from 3 to 1 instead of num parallel (ops...)

Now the loading is pretty fast, though it allocates 20Gb for each GPU... which aligns with your 60Gb of predicted memory usage. However, It's still not quite not clear to be why ollama should preallocate the whole 128k context instead of dynamically creating it based on the actual context (so starting from 0 to N with time)... Does it have anything to do with KV caching?

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025): @ALLMI78 Sure, you are right, though I was not looking for a nice generation, only compute time requirements, and I had only that one as a "small" model. @rick-github, you're right; I've mistakenly changed num models from 3 to 1 instead of num parallel (ops...) Now the loading is pretty fast, though it allocates 20Gb for each GPU... which aligns with your 60Gb of predicted memory usage. However, It's still not quite not clear to be why ollama should preallocate the whole 128k context instead of dynamically creating it based on the actual context (so starting from 0 to N with time)... Does it have anything to do with KV caching?

GiteaMirror commented

2026-05-04 14:19:37 -05:00

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):

[OT] @ALLMI78 https://huggingface.co/microsoft/Phi-4-multimodal-instruct WELL, new version just dropped eheh

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025): [OT] @ALLMI78 https://huggingface.co/microsoft/Phi-4-multimodal-instruct WELL, new version just dropped eheh

GiteaMirror commented

2026-05-04 14:19:40 -05:00

@ALLMI78 commented on GitHub (Mar 19, 2025):

@rick-github Dear Rick,

Thank you for the detailed explanations. Yes, you had already explained the OLLAMA_NUM_PARALLEL = 1/3, and I understood that.

I had already noticed that different models use different tokenizers. For example, i saw that Llama models require significantly fewer tokens than Qwen models (about 30-40% less). I understood all your explanations and think everything is absolutely correct—except for one point…

With a Gemma 3 12B Q4_K_M and 32K context, I am seeing 24 GB memory usage. The model itself is around 8 GB, so that would mean another 16 GB for the 32K context, including the image projector?

For Qwen 14B Q4_K_M with 8GB size, I get 16-17 GB memory usage with 32K context, so 8 for the model and 8 for the 32k context.

Because they told us "The current, most capable model that runs on a single GPU. " and since there were initial RAM issues with Gemma 3, I remained a bit skeptical. I'm not able to run the 12b (or only @ 1t/s) and the 4b does also not work for me, it answers crap.

I trust you—if you say everything fits and is normal, I accept that. I just noticed that the symptoms here were exactly the same as for me before: too much memory usage → swapping to CPU → slow speed (~1 token/s), exactly like in my case.

@AlbertoSinigaglia

I saw Phi-4 Multimodal, but there’s no GGUF yet, and I had issues converting it myself.
https://huggingface.co/models?sort=trending&search=phi-4+multimodal+gguf
Does everything work normally now? Inference speed all good?

@ALLMI78 commented on GitHub (Mar 19, 2025): @rick-github Dear Rick, Thank you for the detailed explanations. Yes, you had already explained the `OLLAMA_NUM_PARALLEL = 1/3`, and I understood that. I had already noticed that different models use different tokenizers. For example, i saw that Llama models require significantly fewer tokens than Qwen models (about 30-40% less). I understood all your explanations and think everything is absolutely correct—except for one point… With a **Gemma 3 12B Q4_K_M** and **32K context**, I am seeing **24 GB memory usage**. The model itself is around **8 GB**, so that would mean another **16 GB for the 32K context**, including the **image projector**? For **Qwen 14B Q4_K_M** with 8GB size, I get **16-17 GB** memory usage with **32K context**, so 8 for the model and 8 for the 32k context. Because they told us "The current, most capable model that runs on a single GPU. " and since there were initial RAM issues with **Gemma 3**, I remained a bit skeptical. I'm not able to run the 12b (or only @ 1t/s) and the 4b does also not work for me, it answers crap. I trust you—if you say everything fits and is normal, I accept that. I just noticed that the symptoms here were exactly the same as for me before: **too much memory usage → swapping to CPU → slow speed (~1 token/s),** exactly like in my case. @AlbertoSinigaglia - I saw **Phi-4 Multimodal**, but there’s no **GGUF** yet, and I had issues converting it myself. https://huggingface.co/models?sort=trending&search=phi-4+multimodal+gguf - Does everything work normally now? Inference speed all good?

GiteaMirror commented

2026-05-04 14:19:42 -05:00

@ALLMI78 commented on GitHub (Mar 19, 2025):

### OT -> preallocate

I think it won’t work without preallocate or it is hard to get a stable solution. You can’t change the context size during runtime.

Imagine you start with 2K context and you later want to increase it, but in the meantime, the user has loaded another software that blocks VRAM, or some process is using it.

If you now need to dynamically increase memory usage because the context grows, it won’t work—suddenly, there’s no memory available.

A simple explanation—if something is incorrect, Rick can probably explain it better. ;)

@ALLMI78 commented on GitHub (Mar 19, 2025): **### _OT -> preallocate_** I think it won’t work without **preallocate** or it is hard to get a stable solution. You can’t change the context size during runtime. Imagine you start with **2K context** and you later want to increase it, but in the meantime, the user has loaded another software that blocks VRAM, or some process is using it. If you now need to **dynamically** increase memory usage because the context grows, it won’t work—suddenly, there’s no memory available. A simple explanation—if something is incorrect, Rick can probably explain it better. ;)

GiteaMirror commented

2026-05-04 14:19:45 -05:00

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):

@ALLMI78 maybe they meant an H100 as "single GPU"...

Anyway, it still takes a lot, and also the Gemma models afaik are notorious for absurd tokenizers, because they work great on TPUs (not as much on GPUs), at least that's what I was told

On the GGUF, I'm still not quite at the point were I'm comfortable "making a GGUF version", still quite new to the world of LLMs. (as an OT, does it make such a large difference?)

About the RAG... sort of, it still takes a while for the first generation. Connecting to your last comment, I'm not sure that is just the allocation the problem. Allocating 40GBs in the GPU takes just few seconds, loading a 40Gb model in GPU takes like 8/10 seconds, but allocating a 20GB model with a large context (overall 40Gb mem required) takes waaay longer...

@AlbertoSinigaglia commented on GitHub (Mar 19, 2025): @ALLMI78 maybe they meant an H100 as "single GPU"... Anyway, it still takes a lot, and also the Gemma models afaik are notorious for absurd tokenizers, because they work great on TPUs (not as much on GPUs), at least that's what I was told On the GGUF, I'm still not quite at the point were I'm comfortable "making a GGUF version", still quite new to the world of LLMs. (as an OT, does it make such a large difference?) About the RAG... sort of, it still takes a while for the first generation. Connecting to your last comment, I'm not sure that is just the allocation the problem. Allocating 40GBs in the GPU takes just few seconds, loading a 40Gb model in GPU takes like 8/10 seconds, but allocating a 20GB model with a large context (overall 40Gb mem required) takes waaay longer...

GiteaMirror commented

2026-05-04 14:19:47 -05:00

@ALLMI78 commented on GitHub (Mar 19, 2025):

Off-topic: The H100 costs around 38,000 euros here—I have no idea what Google expects. ;)

I’m already amazed how some people (like you) have three GPUs, when even one costs 5,000 euros.

Do you all buy them used, steal them, or am I just too dumb or too poor? ;)

@ALLMI78 commented on GitHub (Mar 19, 2025): Off-topic: The **H100** costs around **38,000 euros** here—I have no idea what Google expects. ;) I’m already amazed how some people (like you) have **three GPUs**, when even **one** costs **5,000 euros**. Do you all buy them **used**, **steal them**, or am I just **too dumb or too poor**? ;)

GiteaMirror commented

2026-05-04 14:19:48 -05:00

@AlbertoSinigaglia commented on GitHub (Mar 20, 2025):

@ALLMI78 sooo I've re-downloaded all models in their GGUF version, I'm not seeing a major different in load speed, but a small speedbump on the generation side (though the Gemma3 model I was testing was already a GGUF version)... regarding the GPUs, the answer is "I'm in academia"

@rick-github Any suggestion on how to reduce the slowdown due by the context size? Is it impossible to have a sort of "gradual" allocation of the context size? For example, allocating 8k, and once the model used the first 4k, you get other 8k token allocation and so on?

@AlbertoSinigaglia commented on GitHub (Mar 20, 2025): @ALLMI78 sooo I've re-downloaded all models in their GGUF version, I'm not seeing a major different in load speed, but a small speedbump on the generation side (though the Gemma3 model I was testing was already a GGUF version)... regarding the GPUs, the answer is "I'm in academia" @rick-github Any suggestion on how to reduce the slowdown due by the context size? Is it impossible to have a sort of "gradual" allocation of the context size? For example, allocating 8k, and once the model used the first 4k, you get other 8k token allocation and so on?

GiteaMirror commented

2026-05-04 14:19:49 -05:00

@rick-github commented on GitHub (Mar 20, 2025):

However, It's still not quite not clear to be why ollama should preallocate the whole 128k context instead of dynamically creating it

It's easier. The positional encodings used in the context buffer are based on continuous sine/cosine functions and the attention mask is computed on-the-fly so there's no technical reason for constraining the window (other than the training size of the model), but it makes it easier to manage. For most applications, allocating the max size on the GPU is simpler than trying to manage memory and competing with other clients. The downside is that if you need more context than is available on the GPU, you get a performance hit. There are mechanisms to cope with this like flash attention and sliding window optimization (#9892). gemma3 is a new model and has some unique architectural features so there's some teething problems as it's integrated with the also new architecture of the ollama runners, but the next few releases should see improvements.

With a Gemma 3 12B Q4_K_M and 32K context, I am seeing 24 GB memory usage. The model itself is around 8 GB, so that would mean another 16 GB for the 32K context, including the image projector?

context 12G, model graph 1.1G, projector 0.8G, projector graph 1G.

Because they told us "The current, most capable model that runs on a single GPU. "

As Alberto says, what an enterprise considers a GPU is different to consumer hardware offerings.

I trust you—if you say everything fits and is normal, I accept that.

That's not to say there aren't issues. 0.6.1 uses too much system RAM, which has been addressed in 0.6.2. There are still issues with large VRAM allocations happening during inference, crashing the runner (#9791). q4_0 and q8_0 cache quantization has a significant performance hit (#9683) for gemma3.

Any suggestion on how to reduce the slowdown due by the context size

Flash attention and the sliding window PR should improve things.

Is it impossible to have a sort of "gradual" allocation of the context size?

The current architecture doesn't support that in ollama, but this is simple enough to manage in the client. When the client sends a request, if can set the context via num_ctx. The response will indicate how many tokens the prompt and the response took, and the client can adjust num_ctx if required. This will cause a model reload when the size crosses the maximum available on the GPU, but the model is already in page cache so the reload time will be short(ish).

@rick-github commented on GitHub (Mar 20, 2025): > However, It's still not quite not clear to be why ollama should preallocate the whole 128k context instead of dynamically creating it It's easier. The positional encodings used in the context buffer are based on continuous sine/cosine functions and the attention mask is computed on-the-fly so there's no technical reason for constraining the window (other than the training size of the model), but it makes it easier to manage. For most applications, allocating the max size on the GPU is simpler than trying to manage memory and competing with other clients. The downside is that if you need more context than is available on the GPU, you get a performance hit. There are mechanisms to cope with this like flash attention and sliding window optimization (#9892). gemma3 is a new model and has some unique architectural features so there's some teething problems as it's integrated with the also new architecture of the ollama runners, but the next few releases should see improvements. > With a Gemma 3 12B Q4_K_M and 32K context, I am seeing 24 GB memory usage. The model itself is around 8 GB, so that would mean another 16 GB for the 32K context, including the image projector? context 12G, model graph 1.1G, projector 0.8G, projector graph 1G. > Because they told us "The current, most capable model that runs on a single GPU. " As Alberto says, what an enterprise considers a GPU is different to consumer hardware offerings. > I trust you—if you say everything fits and is normal, I accept that. That's not to say there aren't issues. 0.6.1 uses too much system RAM, which has been addressed in 0.6.2. There are still issues with large VRAM allocations happening during inference, crashing the runner (#9791). q4_0 and q8_0 cache quantization has a significant performance hit (#9683) for gemma3. > Any suggestion on how to reduce the slowdown due by the context size Flash attention and the sliding window PR should improve things. > Is it impossible to have a sort of "gradual" allocation of the context size? The current architecture doesn't support that in ollama, but this is simple enough to manage in the client. When the client sends a request, if can set the context via `num_ctx`. The response will indicate how many tokens the prompt and the response took, and the client can adjust `num_ctx` if required. This will cause a model reload when the size crosses the maximum available on the GPU, but the model is already in page cache so the reload time will be short(ish).

GiteaMirror commented

2026-05-04 14:19:51 -05:00

@ALLMI78 commented on GitHub (Mar 20, 2025):

Wow, thank you so much, Rick, for the detailed insights and the links! The point about "Sliding Window Optimization" is particularly interesting. I've often wondered whether some of the techniques used by Unsloth (they write something about 6x times more context size and so on...) could also be applied in Ollama. I'm not deeply familiar with the topic, but maybe there's still some potential there? Perhaps they are already collaborating with you, or at least exchanging ideas—if that’s not happening already?

"Context: 12 GB, Model Graph: 1.1 GB, Projector: 0.8 GB, Projector Graph: 1 GB."
Very interesting, thanks! I was just a bit confused because I couldn't get a "small" 12B model to run. I mean, if you try loading a "no-name" 12B model and it doesn’t work, that’s kind of expected. But since Gemma2 worked great for me—until the limited context size became an issue—I was really excited about Gemma3. And then came the disappointment. You spend hours trying different things, and nothing works. Of course, that’s part of the game, but I honestly didn’t expect it in this case.

The idea of managing this from the client side is interesting—thanks for the inspirations and for taking the time to explain everything! It was really insightful. :)

@ALLMI78 commented on GitHub (Mar 20, 2025): Wow, thank you so much, Rick, for the detailed insights and the links! The point about "Sliding Window Optimization" is particularly interesting. I've often wondered whether some of the techniques used by Unsloth (they write something about 6x times more context size and so on...) could also be applied in Ollama. I'm not deeply familiar with the topic, but maybe there's still some potential there? Perhaps they are already collaborating with you, or at least exchanging ideas—if that’s not happening already? "Context: 12 GB, Model Graph: 1.1 GB, Projector: 0.8 GB, Projector Graph: 1 GB." Very interesting, thanks! I was just a bit confused because I couldn't get a "small" 12B model to run. I mean, if you try loading a "no-name" 12B model and it doesn’t work, that’s kind of expected. But since Gemma2 worked great for me—until the limited context size became an issue—I was really excited about Gemma3. And then came the disappointment. You spend hours trying different things, and nothing works. Of course, that’s part of the game, but I honestly didn’t expect it in this case. The idea of managing this from the client side is interesting—thanks for the inspirations and for taking the time to explain everything! It was really insightful. :)

GiteaMirror commented

2026-05-04 14:19:51 -05:00

@ALLMI78 commented on GitHub (Mar 20, 2025):

Offtopic – Personal

Rick, thank you—and thanks to everyone else here—who contributes out of passion and a genuine desire to help. The topic of AI is far too important to be left solely in the hands of a few ultra-rich individuals or hidden behind paywalls. Every small explanation, every constructive discussion, every shared image, and every bit of information about how things work helps to open the door to this new world for non-academics like me. Keep up the great work! :)

@ALLMI78 commented on GitHub (Mar 20, 2025): **Offtopic – Personal** Rick, thank you—and thanks to everyone else here—who contributes out of passion and a genuine desire to help. The topic of AI is far too important to be left solely in the hands of a few ultra-rich individuals or hidden behind paywalls. Every small explanation, every constructive discussion, every shared image, and every bit of information about how things work helps to open the door to this new world for non-academics like me. Keep up the great work! :)

GiteaMirror commented

2026-05-04 14:19:52 -05:00

@rick-github commented on GitHub (Mar 20, 2025):

or at least exchanging ideas

LLMs are an evolving field, and still relatively young. There's lots of research going into improving training and inference, as those ideas and implementations mature they will be integrated into various code bases.

I was really excited about Gemma3

Yeah, as a small, capable model that does vision and (unofficially) tools, this is looking like a great foot soldier for the upcoming Agentic Wars. I think some of the problems actually come down to timing - a new model with novel architectural features and a revamp of the ollama runner architecture at the same time resulted in a less than stellar launch.

@rick-github commented on GitHub (Mar 20, 2025): > or at least exchanging ideas LLMs are an evolving field, and still relatively young. There's lots of research going into improving training and inference, as those ideas and implementations mature they will be integrated into various code bases. > I was really excited about Gemma3 Yeah, as a small, capable model that does vision and (unofficially) tools, this is looking like a great foot soldier for the upcoming Agentic Wars. I think some of the problems actually come down to timing - a new model with novel architectural features _and_ a revamp of the ollama runner architecture at the same time resulted in a less than stellar launch.

GiteaMirror commented

2026-05-04 14:19:53 -05:00

@AlbertoSinigaglia commented on GitHub (Mar 21, 2025):

I deeply share the gratitude of @ALLMI78, @rick-github thanks again for the explanation. I actually came here from this issue #11828, and only later realized that the problem was the context length, so I'll probably open a new issue on OpenWebUI to see if they can manage to implement that "simple" dynamic context length

I'll keep an eye on #9892, looks very promising

@AlbertoSinigaglia commented on GitHub (Mar 21, 2025): I deeply share the gratitude of @ALLMI78, @rick-github thanks again for the explanation. I actually came here from this issue [#11828](https://github.com/open-webui/open-webui/issues/11828#issuecomment-2735998669), and only later realized that the problem was the context length, so I'll probably open a new issue on OpenWebUI to see if they can manage to implement that "simple" dynamic context length I'll keep an eye on #9892, looks very promising

GiteaMirror commented

2026-05-04 14:19:55 -05:00

@AlbertoSinigaglia commented on GitHub (Mar 23, 2025):

Hi @rick-github, i've seen the pre-release with the sliding window... may I ask why that's available only for gemma3? For example, also Llama3.3 Instruct has 130k context length. From a "Web client perspective", I think it would be nice to have a request parameter to request fixed or dynamic allocation of the context, like what num_ctx does now for the context

Side question: how do you recognize if a model is a Gemma3 model? E.g. would that work also with unsloth gemma3 model?

@AlbertoSinigaglia commented on GitHub (Mar 23, 2025): Hi @rick-github, i've seen the pre-release with the sliding window... may I ask why that's available only for gemma3? For example, also Llama3.3 Instruct has 130k context length. From a "Web client perspective", I think it would be nice to have a request parameter to request fixed or dynamic allocation of the context, like what `num_ctx` does now for the context Side question: how do you recognize if a model is a Gemma3 model? E.g. would that work also with unsloth gemma3 model?

GiteaMirror commented

2026-05-04 14:19:58 -05:00

@rick-github commented on GitHub (Mar 23, 2025):

may I ask why that's available only for gemma3?

It's a feature of the modified architecture that gemma3 has. To quote from the technical report:

A challenge with long context is the memory explosion of the KV cache during inference. To reduce this issue, we interleave multiple local layers between each global layer, and assign a smaller span of only 1024 tokens to the local layers. Therefore, only the global layers attend to long context, and we have 1 global for every 5 local layers.

For example, also Llama3.3 Instruct has 130k context length.

Architectural changes can't be backported to an existing model, but now that the Deepmind team have demonstrated the advantages of this approach, new models may adopt it. Probably not llama3.4, maybe llama4. Or they could come up with their own architectural tweaks.

Side question: how do you recognize if a model is a Gemma3 model? E.g. would that work also with unsloth gemma3 model?

It should work for all derivations of gemma3. I'm not familiar with what unsloth does to their model releases, so I can't say "will work".

@rick-github commented on GitHub (Mar 23, 2025): > may I ask why that's available only for gemma3? It's a feature of the modified architecture that gemma3 has. To quote from the [technical report](https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf): > A challenge with long context is the memory explosion of the KV cache during inference. To reduce this issue, we interleave multiple local layers between each global layer, and assign a smaller span of only 1024 tokens to the local layers. Therefore, only the global layers attend to long context, and we have 1 global for every 5 local layers. > For example, also Llama3.3 Instruct has 130k context length. Architectural changes can't be backported to an existing model, but now that the Deepmind team have demonstrated the advantages of this approach, new models may adopt it. Probably not llama3.4, maybe llama4. Or they could come up with their own architectural tweaks. > Side question: how do you recognize if a model is a Gemma3 model? E.g. would that work also with unsloth gemma3 model? It _should_ work for all derivations of gemma3. I'm not familiar with what unsloth does to their model releases, so I can't say "_will_ work".

GiteaMirror commented

2026-05-04 14:20:01 -05:00

@AlbertoSinigaglia commented on GitHub (Mar 24, 2025):

@rick-github I though that the PR was aiming at building something like vLLM Paged Attention, which (at first glance) looks like the "more sound" approach to this problem, instead of relying on Google (or whoever in other cases), which might not care that much given their hardware availability

Do you feel that Paged Attention might still be coming to Ollama?

@AlbertoSinigaglia commented on GitHub (Mar 24, 2025): @rick-github I though that the PR was aiming at building something like vLLM Paged Attention, which (at first glance) looks like the "more sound" approach to this problem, instead of relying on Google (or whoever in other cases), which might not care that much given their hardware availability Do you feel that Paged Attention might still be coming to Ollama?

GiteaMirror commented

2026-05-04 14:20:04 -05:00

@jessegross commented on GitHub (Mar 24, 2025):

Sliding window attention and paged attention are mostly orthogonal - they can done separately or together.

Paged attention mostly helps with memory management in multi-request scenarios, whereas sliding window attention reduces the effective context size (memory/GPU cost) for each request. The latter is the one that is more helpful for the discussion here.

As Rick said, sliding window attention is an architectural feature of Gemma and not something that can just be turned on for other models. In fact, sliding window attention was implemented in Gemma from the first release, the upcoming release just has a more optimized implementation of it.

@jessegross commented on GitHub (Mar 24, 2025): Sliding window attention and paged attention are mostly orthogonal - they can done separately or together. Paged attention mostly helps with memory management in multi-request scenarios, whereas sliding window attention reduces the effective context size (memory/GPU cost) for each request. The latter is the one that is more helpful for the discussion here. As Rick said, sliding window attention is an architectural feature of Gemma and not something that can just be turned on for other models. In fact, sliding window attention was implemented in Gemma from the first release, the upcoming release just has a more optimized implementation of it.

GiteaMirror commented

2026-05-04 14:20:06 -05:00

@AlbertoSinigaglia commented on GitHub (Mar 24, 2025):

@jessegross citing the original paper:

Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into KV blocks.

I'm not expert enough to be confident about it, but I'd throw there an educated guess, saying that I don't see how this can be extended to same-sequence/single-request scenario. At least from the paper, it seems just an efficient way to "see" a non-contiguous tensor as contiguous, thus allowing to first allocate (for example) an 8k context vector, and then increase it if the LLM is getting close to that limit, without having to reload the whole LLM or the tensor (instead, you just need to allocate a second 8k long tensor to use as context extension).

Though, if this is not the case, feel free to correct me.

@AlbertoSinigaglia commented on GitHub (Mar 24, 2025): @jessegross citing the original paper: > Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into KV blocks. I'm not expert enough to be confident about it, but I'd throw there an educated guess, saying that I don't see how this can be extended to same-sequence/single-request scenario. At least from the paper, it seems just an efficient way to "see" a non-contiguous tensor as contiguous, thus allowing to first allocate (for example) an 8k context vector, and then increase it if the LLM is getting close to that limit, without having to reload the whole LLM or the tensor (instead, you just need to allocate a second 8k long tensor to use as context extension). Though, if this is not the case, feel free to correct me.

GiteaMirror commented

2026-05-04 14:20:07 -05:00

@AlbertoSinigaglia commented on GitHub (Apr 3, 2025):

Came back here to this closed issue just to say that the new Gemma3 runtime is amazing, and all of you maintaining this project are such amazing, thank you so much!!!

(@rick-github)

@AlbertoSinigaglia commented on GitHub (Apr 3, 2025): Came back here to this closed issue just to say that the new Gemma3 runtime is amazing, and all of you maintaining this project are such amazing, thank you so much!!! (@rick-github)

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#68533