[GH-ISSUE #11018] gemma3:12b-it-qat Reports Different Memory Usage (12 GB vs. 22 GB) on Identical Model Configurations #33025

Closed
opened 2026-04-22 15:10:12 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @luke2023 on GitHub (Jun 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11018

What is the issue?

When running the gemma3:12b-it-qat model on two different PCs using Ollama, the memory usage reported by ollama ps differs significantly despite identical model configurations:

PC 1: Reports 12 GB memory usage.
PC 2: Reports 22 GB memory usage. Both PCs use 100% GPU, and the model configuration (architecture, parameters, quantization, etc.) appears identical.
What did you expect to happen?
I expected the memory usage to be consistent across both PCs, as the model (gemma3:12b-it-qat) and its parameters (Q4_0 quantization, 12.2B parameters, 131k context length) are the same.

What actually happened?
PC 2 reports a much higher memory usage (22 GB) compared to PC 1 (12 GB), which suggests a potential issue with configuration, hardware, or Ollama’s memory reporting.

Environment
PC2 having 5090. it used to be normal but now it has a lot higher memory usage
PC1 having 4090

Ollama Version:
PC 1: 0.9.0
PC 2: 0.9.0
Model: gemma3:12b-it-qat (Q4_0 quantization, 12.2B parameters, 131k context length)
Operating System: Windows (specify version, e.g., Windows 11 Pro)

CUDA Toolkit Version:
PC 2: [ 12.8]
PC 1: [12.4]
System RAM:
PC 2: [ 128 GB]
PC 1: [, 64 GB]
Other Running Models:
PC 1: bge-m3:latest (1.7 GB)
PC 2: None

Relevant log output

PC 1 & PC2 (ollama show gemma3:12b-it-qat):
  Model
    architecture        gemma3
    parameters          12.2B
    context length      131072
    embedding length    3840
    quantization        Q4_0

  Capabilities
    completion
    vision

  Parameters
    temperature    1
    top_k          64
    top_p          0.95
    stop           "<end_of_turn>"

PC 1 (ollama ps):
NAME                 ID              SIZE      PROCESSOR    UNTIL
gemma3:12b-it-qat    5d4fa005e7bb    12 GB     100% GPU     4 minutes from now
bge-m3:latest        790764642607    1.7 GB    100% GPU     2 minutes from now

PC 2 (ollama ps):
NAME                 ID              SIZE     PROCESSOR    UNTIL
gemma3:12b-it-qat    5d4fa005e7bb    22 GB    100% GPU     59 minutes from now

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.9.0

Originally created by @luke2023 on GitHub (Jun 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11018 ### What is the issue? When running the gemma3:12b-it-qat model on two different PCs using Ollama, the memory usage reported by ollama ps differs significantly despite identical model configurations: PC 1: Reports 12 GB memory usage. PC 2: Reports 22 GB memory usage. Both PCs use 100% GPU, and the model configuration (architecture, parameters, quantization, etc.) appears identical. What did you expect to happen? I expected the memory usage to be consistent across both PCs, as the model (gemma3:12b-it-qat) and its parameters (Q4_0 quantization, 12.2B parameters, 131k context length) are the same. **What actually happened?** PC 2 reports a much higher memory usage (22 GB) compared to PC 1 (12 GB), which suggests a potential issue with configuration, hardware, or Ollama’s memory reporting. **Environment** PC2 having 5090. it used to be normal but now it has a lot higher memory usage PC1 having 4090 Ollama Version: PC 1: 0.9.0 PC 2: 0.9.0 Model: gemma3:12b-it-qat (Q4_0 quantization, 12.2B parameters, 131k context length) Operating System: Windows (specify version, e.g., Windows 11 Pro) CUDA Toolkit Version: PC 2: [ 12.8] PC 1: [12.4] System RAM: PC 2: [ 128 GB] PC 1: [, 64 GB] Other Running Models: PC 1: bge-m3:latest (1.7 GB) PC 2: None ### Relevant log output ```shell PC 1 & PC2 (ollama show gemma3:12b-it-qat): Model architecture gemma3 parameters 12.2B context length 131072 embedding length 3840 quantization Q4_0 Capabilities completion vision Parameters temperature 1 top_k 64 top_p 0.95 stop "<end_of_turn>" PC 1 (ollama ps): NAME ID SIZE PROCESSOR UNTIL gemma3:12b-it-qat 5d4fa005e7bb 12 GB 100% GPU 4 minutes from now bge-m3:latest 790764642607 1.7 GB 100% GPU 2 minutes from now PC 2 (ollama ps): NAME ID SIZE PROCESSOR UNTIL gemma3:12b-it-qat 5d4fa005e7bb 22 GB 100% GPU 59 minutes from now ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.9.0
GiteaMirror added the bug label 2026-04-22 15:10:12 -05:00
Author
Owner

@rick-github commented on GitHub (Jun 8, 2025):

Differences in allocated size are usually because of differences in context buffer or parallelism. Server logs will show the differences.

<!-- gh-comment-id:2953996323 --> @rick-github commented on GitHub (Jun 8, 2025): Differences in allocated size are usually because of differences in context buffer or parallelism. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will show the differences.
Author
Owner

@luke2023 commented on GitHub (Jun 10, 2025):

@rick-github
Thanks for the help
this is my server log
time=2025-06-10T18:36:10.445+08:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:131072 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Luke\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:2 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-06-10T18:36:10.448+08:00 level=INFO source=images.go:479 msg="total blobs: 10" time=2025-06-10T18:36:10.448+08:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0" time=2025-06-10T18:36:10.449+08:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:11434 (version 0.9.0)" time=2025-06-10T18:36:10.449+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-06-10T18:36:10.449+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-06-10T18:36:10.449+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=12 efficiency=0 threads=24 time=2025-06-10T18:36:10.606+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-529fd035-042a-13c4-94e9-b645f0097b2e library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" total="31.8 GiB" available="30.1 GiB"

<!-- gh-comment-id:2958617515 --> @luke2023 commented on GitHub (Jun 10, 2025): @rick-github Thanks for the help this is my server log `time=2025-06-10T18:36:10.445+08:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:131072 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Luke\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:2 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-06-10T18:36:10.448+08:00 level=INFO source=images.go:479 msg="total blobs: 10" time=2025-06-10T18:36:10.448+08:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0" time=2025-06-10T18:36:10.449+08:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:11434 (version 0.9.0)" time=2025-06-10T18:36:10.449+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-06-10T18:36:10.449+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-06-10T18:36:10.449+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=12 efficiency=0 threads=24 time=2025-06-10T18:36:10.606+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-529fd035-042a-13c4-94e9-b645f0097b2e library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" total="31.8 GiB" available="30.1 GiB" `
Author
Owner

@rick-github commented on GitHub (Jun 10, 2025):

This is not enough log.

<!-- gh-comment-id:2958628016 --> @rick-github commented on GitHub (Jun 10, 2025): This is not enough log.
Author
Owner

@luke2023 commented on GitHub (Jun 10, 2025):

@rick-github got it, but I accidently solved the problem. It is unresonable.

After removing the content length environment variable of 131072 with powershell, the machine went back to normal...

 PS C:\Users\Luke> Get-ChildItem Env:OLLAMA_CONTEXT_LENGTH
Name                           Value
OLLAMA_CONTEXT_LENGTH          131072

But I inspected the model finding that it is still using the same amount of content length (131072) , could you exaplain this ? Thanks

NAME                 ID              SIZE      PROCESSOR    UNTIL
gemma3:12b-it-qat    5d4fa005e7bb    12 GB     100% GPU     59 minutes from now
gemma3:1b            8648f39daa8f    1.9 GB    100% GPU     59 minutes from now
PS C:\Users\Luke> ollama show gemma3:12b-it-qat
  Model
    architecture        gemma3
    parameters          12.2B
    context length      131072
    embedding length    3840
    quantization        Q4_0

  Capabilities
    completion
    vision

  Parameters
    stop           "<end_of_turn>"
    temperature    1
    top_k          64
    top_p          0.95

PS C:\Users\Luke>```
<!-- gh-comment-id:2958692512 --> @luke2023 commented on GitHub (Jun 10, 2025): @rick-github got it, but I accidently solved the problem. It is unresonable. After removing the content length environment variable of 131072 with powershell, the machine went back to normal... ``` PS C:\Users\Luke> Get-ChildItem Env:OLLAMA_CONTEXT_LENGTH Name Value OLLAMA_CONTEXT_LENGTH 131072 ``` But I inspected the model finding that it is still using the same amount of content length (131072) , could you exaplain this ? Thanks ```PS C:\Users\Luke> ollama ps NAME ID SIZE PROCESSOR UNTIL gemma3:12b-it-qat 5d4fa005e7bb 12 GB 100% GPU 59 minutes from now gemma3:1b 8648f39daa8f 1.9 GB 100% GPU 59 minutes from now PS C:\Users\Luke> ollama show gemma3:12b-it-qat Model architecture gemma3 parameters 12.2B context length 131072 embedding length 3840 quantization Q4_0 Capabilities completion vision Parameters stop "<end_of_turn>" temperature 1 top_k 64 top_p 0.95 PS C:\Users\Luke>```
Author
Owner

@rick-github commented on GitHub (Jun 10, 2025):

context length 131072

This is the context length the model was trained with.

OLLAMA_CONTEXT_LENGTH:131072

This is the length of the context window that ollama uses for a completion. This value can be changed, context length cannot be changed.

<!-- gh-comment-id:2958873026 --> @rick-github commented on GitHub (Jun 10, 2025): > context length 131072 This is the context length the model was trained with. > OLLAMA_CONTEXT_LENGTH:131072 This is the length of the context window that ollama uses for a completion. This value can be changed, `context length` cannot be changed.
Author
Owner

@luke2023 commented on GitHub (Jun 10, 2025):

@rick-github Thank you for your help ! Learned a lot !!

<!-- gh-comment-id:2958877342 --> @luke2023 commented on GitHub (Jun 10, 2025): @rick-github Thank you for your help ! Learned a lot !!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33025