[GH-ISSUE #10529] Larger size for instance of model with lesser num_ctx? #53439

Closed
opened 2026-04-29 03:13:35 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @lasseedfast on GitHub (May 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10529

What is the issue?

I've created two models with different sizes on num_ctx from qwen3 with a model card file like

FROM qwen3:14b
PARAMETER num_ctx SIZE

For some reason the 16k model seems to be bigger than the 8k model.

Info for 16k model:

ollama show qwen3_14b_16k

  Model
    architecture        qwen3     
    parameters          14.8B     
    context length      40960     
    embedding length    5120      
    quantization        Q4_K_M    

  Capabilities
    completion    
    tools         

  Parameters
    temperature       0.6               
    top_k             20                
    top_p             0.95              
    num_ctx           16384             
    repeat_penalty    1                 
    stop              "<|im_start|>"    
    stop              "<|im_end|>"      

  License
    Apache License               
    Version 2.0, January 2004    

Info for 8k model:
ollama show qwen3_14b_8k

  Model
    architecture        qwen3     
    parameters          14.8B     
    context length      40960     
    embedding length    5120      
    quantization        Q4_K_M    

  Capabilities
    completion    
    tools         

  Parameters
    repeat_penalty    1                 
    stop              "<|im_start|>"    
    stop              "<|im_end|>"      
    temperature       0.6               
    top_k             20                
    top_p             0.95              
    num_ctx           8192              

  License
    Apache License               
    Version 2.0, January 2004 

This is what I get when checking the models loaded:
ollama ps

ollama ps
NAME                   ID              SIZE     PROCESSOR    UNTIL              
qwen3_14b_8k:latest    e4a0644ae615    19 GB    100% GPU     4 minutes from now 

ollama ps

NAME                    ID              SIZE     PROCESSOR    UNTIL              
qwen3_14b_16k:latest    c792f685dea0    14 GB    100% GPU     4 minutes from now 

I might have misunderstood something, but shouldn't the 16k model be larger in size?

Relevant log output


OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.6.6

Originally created by @lasseedfast on GitHub (May 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10529 ### What is the issue? I've created two models with different sizes on num_ctx from qwen3 with a model card file like ``` FROM qwen3:14b PARAMETER num_ctx SIZE ``` For some reason the 16k model seems to be bigger than the 8k model. Info for 16k model: `ollama show qwen3_14b_16k` ``` Model architecture qwen3 parameters 14.8B context length 40960 embedding length 5120 quantization Q4_K_M Capabilities completion tools Parameters temperature 0.6 top_k 20 top_p 0.95 num_ctx 16384 repeat_penalty 1 stop "<|im_start|>" stop "<|im_end|>" License Apache License Version 2.0, January 2004 ``` Info for 8k model: `ollama show qwen3_14b_8k` ``` Model architecture qwen3 parameters 14.8B context length 40960 embedding length 5120 quantization Q4_K_M Capabilities completion tools Parameters repeat_penalty 1 stop "<|im_start|>" stop "<|im_end|>" temperature 0.6 top_k 20 top_p 0.95 num_ctx 8192 License Apache License Version 2.0, January 2004 ``` This is what I get when checking the models loaded: `ollama ps` ``` ollama ps NAME ID SIZE PROCESSOR UNTIL qwen3_14b_8k:latest e4a0644ae615 19 GB 100% GPU 4 minutes from now ``` `ollama ps` ``` NAME ID SIZE PROCESSOR UNTIL qwen3_14b_16k:latest c792f685dea0 14 GB 100% GPU 4 minutes from now ``` I might have misunderstood something, but shouldn't the 16k model be larger in size? ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU _No response_ ### Ollama version 0.6.6
GiteaMirror added the questionbug labels 2026-04-29 03:13:39 -05:00
Author
Owner

@rick-github commented on GitHub (May 2, 2025):

Server logs may aid in debugging.

$ ollama ps
NAME                    ID              SIZE     PROCESSOR    UNTIL   
qwen3:14b               7d7da67570e2    10 GB    100% GPU     Forever    
qwen3_14b_8k:latest     e4a0644ae615    12 GB    100% GPU     Forever    
qwen3_14b_16k:latest    0a79612bfa82    14 GB    100% GPU     Forever    
<!-- gh-comment-id:2846805947 --> @rick-github commented on GitHub (May 2, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging. ```console $ ollama ps NAME ID SIZE PROCESSOR UNTIL qwen3:14b 7d7da67570e2 10 GB 100% GPU Forever qwen3_14b_8k:latest e4a0644ae615 12 GB 100% GPU Forever qwen3_14b_16k:latest 0a79612bfa82 14 GB 100% GPU Forever ```
Author
Owner

@Nukepayload2 commented on GitHub (May 2, 2025):

I can reproduce similar behavior on Windows 10 with Ollama 0.6.7. I'm using qwen3:14b-q8_0. The VRAM usage is the same when num_ctx=16384 and 32768. It's probably a bug of Ollama.

Steps:

  1. Enable flash attention by setting environment variable.
  2. Set KV cache type to q8_0 by setting environment variable.
  3. Ollama serve
  4. Run qwen3:14b-q8_0 with the following model file:
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
FROM qwen3:14b-q8_0

PARAMETER num_ctx 32768
PARAMETER num_batch 64
PARAMETER min_p 0.05
  1. Run command /set parameter num_ctx 32768 and then send something to the AI.
  2. Use ollama ps to check VRAM usage
  3. Try different num_ctx values.

Acutal behavior
When num_ctx=32768, the size is 20 GB.
When num_ctx=16384, the size is 20 GB.
When num_ctx=8192, the size is 18 GB.

Expected behavior
The VRAM usage should be the same as the llama-server.exe of llama.cpp.

Log file:
ollama_log_10529.txt

<!-- gh-comment-id:2847013592 --> @Nukepayload2 commented on GitHub (May 2, 2025): I can reproduce similar behavior on Windows 10 with Ollama 0.6.7. I'm using `qwen3:14b-q8_0`. The VRAM usage is the same when num_ctx=16384 and 32768. It's probably a bug of Ollama. Steps: 1. Enable flash attention by setting environment variable. 2. Set KV cache type to q8_0 by setting environment variable. 3. Ollama serve 4. Run qwen3:14b-q8_0 with the following model file: ``` # Modelfile generated by "ollama show" # To build a new Modelfile based on this, replace FROM with: FROM qwen3:14b-q8_0 PARAMETER num_ctx 32768 PARAMETER num_batch 64 PARAMETER min_p 0.05 ``` 5. Run command `/set parameter num_ctx 32768` and then send something to the AI. 6. Use ollama ps to check VRAM usage 7. Try different num_ctx values. Acutal behavior When num_ctx=32768, the size is 20 GB. When num_ctx=16384, the size is 20 GB. When num_ctx=8192, the size is 18 GB. Expected behavior The VRAM usage should be the same as the `llama-server.exe` of llama.cpp. Log file: [ollama_log_10529.txt](https://github.com/user-attachments/files/20012130/ollama_log_10529.txt)
Author
Owner

@rick-github commented on GitHub (May 2, 2025):

Set OLLAMA_NUM_PARALLEL=1 in the server environment.

<!-- gh-comment-id:2847021014 --> @rick-github commented on GitHub (May 2, 2025): Set [`OLLAMA_NUM_PARALLEL=1`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests) in the server environment.
Author
Owner

@Nukepayload2 commented on GitHub (May 2, 2025):

I have set OLLAMA_NUM_PARALLEL=1 and restarted the Ollama server. But the VRAM usage is still higher than expected.

When I'm using qwen3:14b-q8_0, if I set num_ctx to 65535, the size is 25 GB. Which is larger than the VRAM usage of llama-server.exe. llama-server.exe uses 21 GB according to the Windows task manager.

The command of llama.cpp server is llama-server --model L:\MlProjCache\OllamaModels\blobs\sha256-6335adf2028978aee1cd610abcb7047e9b882ad2ebb8214ceee799fd3ddf423b --ctx-size 65536 --batch-size 64 --n-gpu-layers 100 --flash-attn -ctk q8_0 -ctv q8_0 --no-mmap where the file path was copied from ollama show --modelfile.

Log file:
ollama_log_10529_2.txt

The log file of llama-server.exe for comparison:
llama-server_log_10529.txt

<!-- gh-comment-id:2847052989 --> @Nukepayload2 commented on GitHub (May 2, 2025): I have set [OLLAMA_NUM_PARALLEL=1](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests) and restarted the Ollama server. But the VRAM usage is still higher than expected. When I'm using `qwen3:14b-q8_0`, if I set num_ctx to 65535, the size is 25 GB. Which is larger than the VRAM usage of `llama-server.exe`. `llama-server.exe` uses 21 GB according to the Windows task manager. The command of llama.cpp server is `llama-server --model L:\MlProjCache\OllamaModels\blobs\sha256-6335adf2028978aee1cd610abcb7047e9b882ad2ebb8214ceee799fd3ddf423b --ctx-size 65536 --batch-size 64 --n-gpu-layers 100 --flash-attn -ctk q8_0 -ctv q8_0 --no-mmap` where the file path was copied from `ollama show --modelfile`. Log file: [ollama_log_10529_2.txt](https://github.com/user-attachments/files/20012404/ollama_log_10529_2.txt) The log file of llama-server.exe for comparison: [llama-server_log_10529.txt](https://github.com/user-attachments/files/20012453/llama-server_log_10529.txt)
Author
Owner

@rick-github commented on GitHub (May 2, 2025):

ollama memory estimation can be inaccurate, particularly when flash attention is enabled. #6160

<!-- gh-comment-id:2847060673 --> @rick-github commented on GitHub (May 2, 2025): ollama memory estimation can be inaccurate, particularly when flash attention is enabled. #6160
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53439