[GH-ISSUE #9857] Ollama v0.6.2 - Gemma3 Model Stops Responding After a Few Prompts #68511

New Issue

GiteaMirror · 2026-05-04T14:14:46-05:00

GiteaMirror commented

2026-05-04 14:14:46 -05:00

Originally created by @ronaldvdmeer on GitHub (Mar 18, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9857

What is the issue?

When using Ollama v0.6.2 with the model gemma3:27b-it-q4_K_M, the model stops responding after a few interactions. There is no error message, but it no longer generates any output after receiving a prompt.

Environment
• Ollama version: v0.6.2
• Model: gemma3:27b-it-q4_K_M
• OS: Debian 12
• Hardware:
• GPU: NVIDIA RTX 3090
• RAM: 48GB

Steps to Reproduce
Start Ollama and load the model:

ollama run gemma3:27b-it-q4_K_M

Ask a few questions, for example:

>>> hello there, how are you doing today
>>> good to hear my friend. can you tell me something interesting about March 18th
>>> yes sure
>>> what was the name of the spacecraft

After a few responses, the model suddenly stops responding, with no error message displayed.

Expected Behavior
The model should continue generating responses without interruption.

Actual Outcome
After a few successful interactions, the model becomes unresponsive without any error message.

Additional Information
• There is no consistent pattern as to when the issue occurs.
• No errors are logged in the console.
• CPU and GPU usage remain within normal levels.
• Restarting ollama temporarily restores functionality.

Any guidance on resolving this issue or debugging further would be greatly appreciated.

Relevant log output

root@su8ai01:~# ollama show gemma3:27b-it-q4_K_M
  Model
    architecture        gemma3    
    parameters          27.4B     
    context length      8192      
    embedding length    5376      
    quantization        Q4_K_M    

  Parameters
    stop           "<end_of_turn>"    
    temperature    0.1                

  License
    Gemma Terms of Use                  
    Last modified: February 21, 2024    

root@su8ai01:~# ollama run gemma3:27b-it-q4_K_M
>>> hello there, how are you doing today
Hello! As an AI, I don't *experience* feelings like "doing well," but I'm functioning perfectly and ready 
to help! So you could say I'm doing great! 😄 

How about *you*? How are *you* doing today? I hope you're having a good one so far. 

Is there anything I can help you with?





>>> good to hear my friend. can you tell me something interesting about march 18th
You're kind! 😊 Okay, here's something interesting about March 18th:

**On March 18th, 1937, the first blood bank opened in Chicago!** 

It was established by Dr. Bernard Fantus, who pioneered the concept of storing blood for future 
transfusions. Before this, transfusions were often done directly from donor to recipient, which was risky 
and time-consuming. Dr. Fantus realized the need for a readily available supply of blood, and his work 
revolutionized medical care.

Pretty cool, right? It's a day that significantly impacted the field of medicine and saved countless lives!

Would you like to know another interesting fact about March 18th, or perhaps a fact about a different date?





>>> yes sure
Alright! Here's another interesting fact about March 18th:

**On March 18th, 1965, Alexei Leonov, a Soviet cosmonaut, performed the first spacewalk!**

He exited the Voskhod 2 spacecraft and spent 12 minutes and 9 seconds outside the vehicle, tethered by a 
5.35-meter (17.6 ft) umbilical cord. It was a huge moment in the Space Race and a significant achievement 
in human space exploration. He faced some challenges - his suit inflated, making it difficult to re-enter 
the airlock - but he managed it successfully!

Pretty daring, huh? 🚀





>>> what was the name of the spacecraft
The spacecraft Alexei Leonov spacewalked from was called **Voskhod 2** (Восход-2 in Russian).

It was a modified version of the Vostok spacecraft, and it was specifically adapted to allow for a 
spacewalk. It carried a crew of two: Pavel Belyayev (commander) and Alexei Leonov (pilot/spacewalker).

It's interesting

>>> very interesting


>>> can you tell me something more about today


>>> hello?


>>> /bye
root@su8ai01:~# ollama ps
NAME                    ID              SIZE     PROCESSOR    UNTIL   
gemma3:27b-it-q4_K_M    30ddded7fba6    24 GB    100% GPU     Forever    

root@su8ai01:~# topshort
top - 16:07:45 up 22:38,  8 users,  load average: 0.01, 0.13, 0.71
Tasks: 188 total,   1 running, 187 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 50.0 sy,  0.0 ni, 50.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
MiB Mem :  48176.7 total,  21945.4 free,   5502.9 used,  21675.4 buff/cache     
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  42673.8 avail Mem 

root@su8ai01:~# nvidia-smi
Tue Mar 18 16:06:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:06:10.0 Off |                  N/A |
|  0%   36C    P8             17W /  350W |   20968MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    255476      C   /usr/local/bin/ollama                       20962MiB |
+-----------------------------------------------------------------------------------------+

OS

Linux

GPU

Nvidia

CPU

Other

Ollama version

ollama version is 0.6.2

Originally created by @ronaldvdmeer on GitHub (Mar 18, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9857 ### What is the issue? When using `Ollama v0.6.2` with the model `gemma3:27b-it-q4_K_M`, the model stops responding after a few interactions. There is no error message, but it no longer generates any output after receiving a prompt. **Environment** • Ollama version: v0.6.2 • Model: gemma3:27b-it-q4_K_M • OS: Debian 12 • Hardware: • GPU: NVIDIA RTX 3090 • RAM: 48GB **Steps to Reproduce** Start Ollama and load the model: ``` ollama run gemma3:27b-it-q4_K_M ``` Ask a few questions, for example: ``` >>> hello there, how are you doing today >>> good to hear my friend. can you tell me something interesting about March 18th >>> yes sure >>> what was the name of the spacecraft ``` After a few responses, the model suddenly stops responding, with no error message displayed. **Expected Behavior** The model should continue generating responses without interruption. **Actual Outcome** After a few successful interactions, the model becomes unresponsive without any error message. **Additional Information** • There is no consistent pattern as to when the issue occurs. • No errors are logged in the console. • CPU and GPU usage remain within normal levels. • Restarting ollama temporarily restores functionality. Any guidance on resolving this issue or debugging further would be greatly appreciated. ### Relevant log output ```shell root@su8ai01:~# ollama show gemma3:27b-it-q4_K_M Model architecture gemma3 parameters 27.4B context length 8192 embedding length 5376 quantization Q4_K_M Parameters stop "<end_of_turn>" temperature 0.1 License Gemma Terms of Use Last modified: February 21, 2024 root@su8ai01:~# ollama run gemma3:27b-it-q4_K_M >>> hello there, how are you doing today Hello! As an AI, I don't *experience* feelings like "doing well," but I'm functioning perfectly and ready to help! So you could say I'm doing great! 😄 How about *you*? How are *you* doing today? I hope you're having a good one so far. Is there anything I can help you with? >>> good to hear my friend. can you tell me something interesting about march 18th You're kind! 😊 Okay, here's something interesting about March 18th: **On March 18th, 1937, the first blood bank opened in Chicago!** It was established by Dr. Bernard Fantus, who pioneered the concept of storing blood for future transfusions. Before this, transfusions were often done directly from donor to recipient, which was risky and time-consuming. Dr. Fantus realized the need for a readily available supply of blood, and his work revolutionized medical care. Pretty cool, right? It's a day that significantly impacted the field of medicine and saved countless lives! Would you like to know another interesting fact about March 18th, or perhaps a fact about a different date? >>> yes sure Alright! Here's another interesting fact about March 18th: **On March 18th, 1965, Alexei Leonov, a Soviet cosmonaut, performed the first spacewalk!** He exited the Voskhod 2 spacecraft and spent 12 minutes and 9 seconds outside the vehicle, tethered by a 5.35-meter (17.6 ft) umbilical cord. It was a huge moment in the Space Race and a significant achievement in human space exploration. He faced some challenges - his suit inflated, making it difficult to re-enter the airlock - but he managed it successfully! Pretty daring, huh? 🚀 >>> what was the name of the spacecraft The spacecraft Alexei Leonov spacewalked from was called **Voskhod 2** (Восход-2 in Russian). It was a modified version of the Vostok spacecraft, and it was specifically adapted to allow for a spacewalk. It carried a crew of two: Pavel Belyayev (commander) and Alexei Leonov (pilot/spacewalker). It's interesting >>> very interesting >>> can you tell me something more about today >>> hello? >>> /bye root@su8ai01:~# ollama ps NAME ID SIZE PROCESSOR UNTIL gemma3:27b-it-q4_K_M 30ddded7fba6 24 GB 100% GPU Forever root@su8ai01:~# topshort top - 16:07:45 up 22:38, 8 users, load average: 0.01, 0.13, 0.71 Tasks: 188 total, 1 running, 187 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 50.0 sy, 0.0 ni, 50.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 48176.7 total, 21945.4 free, 5502.9 used, 21675.4 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 42673.8 avail Mem root@su8ai01:~# nvidia-smi Tue Mar 18 16:06:19 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:06:10.0 Off | N/A | | 0% 36C P8 17W / 350W | 20968MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 255476 C /usr/local/bin/ollama 20962MiB | +-----------------------------------------------------------------------------------------+ ``` ### OS Linux ### GPU Nvidia ### CPU Other ### Ollama version ollama version is 0.6.2

GiteaMirror added the bug label 2026-05-04 14:14:46 -05:00

GiteaMirror closed this issue

2026-05-04 14:14:49 -05:00

GiteaMirror commented

2026-05-04 14:14:52 -05:00

@rick-github commented on GitHub (Mar 18, 2025):

Server logs will aid in debugging.

@rick-github commented on GitHub (Mar 18, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.

GiteaMirror commented

2026-05-04 14:14:55 -05:00

@ronaldvdmeer commented on GitHub (Mar 18, 2025):

Server logs will aid in debugging.

Mar 18 16:55:27 su8ai01 systemd[1]: Stopped ollama.service - Ollama Service.
Mar 18 16:55:27 su8ai01 systemd[1]: ollama.service: Consumed 1min 36.945s CPU time.
Mar 18 16:55:27 su8ai01 systemd[1]: Started ollama.service - Ollama Service.
Mar 18 16:55:27 su8ai01 ollama[265442]: 2025/03/18 16:55:27 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/data/OllamaModels OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.725+01:00 level=INFO source=images.go:432 msg="total blobs: 25"
Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.725+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.726+01:00 level=INFO source=routes.go:1297 msg="Listening on [::]:11434 (version 0.6.2)"
Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.726+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Mar 18 16:55:28 su8ai01 ollama[265442]: time=2025-03-18T16:55:28.547+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="23.4 GiB"
Mar 18 16:55:39 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:39 | 200 |       48.18µs |       127.0.0.1 | HEAD     "/"
Mar 18 16:55:39 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:39 | 200 |   47.005366ms |       127.0.0.1 | POST     "/api/show"
Mar 18 16:55:39 su8ai01 ollama[265442]: time=2025-03-18T16:55:39.789+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 parallel=4 available=25158156288 required="22.4 GiB"
Mar 18 16:55:39 su8ai01 ollama[265442]: time=2025-03-18T16:55:39.988+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="45.1 GiB" free_swap="0 B"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.4 GiB" memory.required.partial="22.4 GiB" memory.required.kv="3.9 GiB" memory.required.allocations="[22.4 GiB]" memory.weights.total="14.3 GiB" memory.weights.repeating="14.3 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="565.0 MiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=INFO source=server.go:185 msg="enabling flash attention"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type=""
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.340+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.352+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.355+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 8192 --batch-size 512 --n-gpu-layers 63 --threads 4 --flash-attn --parallel 4 --port 38759"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.390+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.390+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:38759"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.623+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: found 1 CUDA devices:
Mar 18 16:55:40 su8ai01 ollama[265442]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 18 16:55:40 su8ai01 ollama[265442]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 18 16:55:40 su8ai01 ollama[265442]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.689+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.785+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.785+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 18 16:55:41 su8ai01 ollama[265442]: time=2025-03-18T16:55:41.075+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 18 16:55:45 su8ai01 ollama[265442]: time=2025-03-18T16:55:45.985+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.280+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.283+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.492+01:00 level=INFO source=server.go:619 msg="llama runner started in 7.12 seconds"
Mar 18 16:55:47 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:47 | 200 |  8.252148192s |       127.0.0.1 | POST     "/api/generate"
Mar 18 16:55:56 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:56 | 200 |  3.229971972s |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:03 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:03 | 200 |  2.951987604s |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:15 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:15 | 200 |  3.245265393s |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:22 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:22 | 200 |  2.716038903s |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:28 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:28 | 200 |  2.794662842s |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:34 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:34 | 200 |  134.447763ms |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:38 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:38 | 200 |   98.775391ms |       127.0.0.1 | POST     "/api/chat"

@ronaldvdmeer commented on GitHub (Mar 18, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging. ``` Mar 18 16:55:27 su8ai01 systemd[1]: Stopped ollama.service - Ollama Service. Mar 18 16:55:27 su8ai01 systemd[1]: ollama.service: Consumed 1min 36.945s CPU time. Mar 18 16:55:27 su8ai01 systemd[1]: Started ollama.service - Ollama Service. Mar 18 16:55:27 su8ai01 ollama[265442]: 2025/03/18 16:55:27 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/data/OllamaModels OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.725+01:00 level=INFO source=images.go:432 msg="total blobs: 25" Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.725+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.726+01:00 level=INFO source=routes.go:1297 msg="Listening on [::]:11434 (version 0.6.2)" Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.726+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Mar 18 16:55:28 su8ai01 ollama[265442]: time=2025-03-18T16:55:28.547+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="23.4 GiB" Mar 18 16:55:39 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:39 | 200 | 48.18µs | 127.0.0.1 | HEAD "/" Mar 18 16:55:39 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:39 | 200 | 47.005366ms | 127.0.0.1 | POST "/api/show" Mar 18 16:55:39 su8ai01 ollama[265442]: time=2025-03-18T16:55:39.789+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 parallel=4 available=25158156288 required="22.4 GiB" Mar 18 16:55:39 su8ai01 ollama[265442]: time=2025-03-18T16:55:39.988+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="45.1 GiB" free_swap="0 B" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.4 GiB" memory.required.partial="22.4 GiB" memory.required.kv="3.9 GiB" memory.required.allocations="[22.4 GiB]" memory.weights.total="14.3 GiB" memory.weights.repeating="14.3 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="565.0 MiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=INFO source=server.go:185 msg="enabling flash attention" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type="" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.340+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.352+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.355+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 8192 --batch-size 512 --n-gpu-layers 63 --threads 4 --flash-attn --parallel 4 --port 38759" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.390+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.390+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:38759" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.623+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: found 1 CUDA devices: Mar 18 16:55:40 su8ai01 ollama[265442]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 18 16:55:40 su8ai01 ollama[265442]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 18 16:55:40 su8ai01 ollama[265442]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.689+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.785+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.785+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 18 16:55:41 su8ai01 ollama[265442]: time=2025-03-18T16:55:41.075+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 18 16:55:45 su8ai01 ollama[265442]: time=2025-03-18T16:55:45.985+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.280+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.283+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.492+01:00 level=INFO source=server.go:619 msg="llama runner started in 7.12 seconds" Mar 18 16:55:47 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:47 | 200 | 8.252148192s | 127.0.0.1 | POST "/api/generate" Mar 18 16:55:56 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:56 | 200 | 3.229971972s | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:03 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:03 | 200 | 2.951987604s | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:15 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:15 | 200 | 3.245265393s | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:22 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:22 | 200 | 2.716038903s | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:28 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:28 | 200 | 2.794662842s | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:34 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:34 | 200 | 134.447763ms | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:38 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:38 | 200 | 98.775391ms | 127.0.0.1 | POST "/api/chat" ```

GiteaMirror commented

2026-05-04 14:14:57 -05:00

@ronaldvdmeer commented on GitHub (Mar 18, 2025):

Server logs will aid in debugging.

Debug logging: https://gist.github.com/ronaldvdmeer/5dab71e495370ee96aa22798b0c79a9e

@ronaldvdmeer commented on GitHub (Mar 18, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging. Debug logging: [https://gist.github.com/ronaldvdmeer/5dab71e495370ee96aa22798b0c79a9e](https://gist.github.com/ronaldvdmeer/5dab71e495370ee96aa22798b0c79a9e)

GiteaMirror commented

2026-05-04 14:14:59 -05:00

@mmb78 commented on GitHub (Mar 19, 2025):

Not sure if this is related .. but I'm analyzing a long list of images that I feed by a python script using OpenAI API to gemma3 models and I have observed several times that after about 100 prompts the analysis time doubles (for 27 and 12b models, q8 and q4 tested).

@mmb78 commented on GitHub (Mar 19, 2025): Not sure if this is related .. but I'm analyzing a long list of images that I feed by a python script using OpenAI API to gemma3 models and I have observed several times that after about 100 prompts the analysis time doubles (for 27 and 12b models, q8 and q4 tested).

GiteaMirror commented

2026-05-04 14:15:00 -05:00

@ronaldvdmeer commented on GitHub (Mar 19, 2025):

Update, after changing some settings in de systemd script the model keeps running as expected

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MODELS=/data/OllamaModels"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_KEEP_ALIVE=-1"
#Environment="OLLAMA_NO_CPU_FALLBACK=1"
#Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_DEBUG=1"

[Install]
WantedBy=default.target

I commented out OLLAMA_NO_CPU_FALLBACK=1 and OLLAMA_FLASH_ATTENTION=1

@ronaldvdmeer commented on GitHub (Mar 19, 2025): Update, after changing some settings in de systemd script the model keeps running as expected ``` [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" Environment="OLLAMA_HOST=0.0.0.0" Environment="OLLAMA_MODELS=/data/OllamaModels" Environment="OLLAMA_ORIGINS=*" Environment="OLLAMA_KEEP_ALIVE=-1" #Environment="OLLAMA_NO_CPU_FALLBACK=1" #Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_DEBUG=1" [Install] WantedBy=default.target ``` I commented out `OLLAMA_NO_CPU_FALLBACK=1` and `OLLAMA_FLASH_ATTENTION=1`

GiteaMirror commented

2026-05-04 14:15:01 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

OLLAMA_NO_CPU_FALLBACK is not an ollama configuration variable, so FA would seem to be the culprit.

@rick-github commented on GitHub (Mar 19, 2025): `OLLAMA_NO_CPU_FALLBACK` is not an ollama configuration variable, so FA would seem to be the culprit.

GiteaMirror commented

2026-05-04 14:15:01 -05:00

@ronaldvdmeer commented on GitHub (Mar 19, 2025):

OLLAMA_NO_CPU_FALLBACK is not an ollama configuration variable, so FA would seem to be the culprit.

Indeed. Is FA something that should work with this model?

@ronaldvdmeer commented on GitHub (Mar 19, 2025): > `OLLAMA_NO_CPU_FALLBACK` is not an ollama configuration variable, so FA would seem to be the culprit. Indeed. Is FA something that should work with this model?

GiteaMirror commented

2026-05-04 14:15:02 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

Mar 18 17:00:56 su8ai01 ollama[267855]: time=2025-03-18T17:00:56.939+01:00 level=INFO source=server.go:185 msg="enabling flash attention"
Mar 18 17:00:56 su8ai01 ollama[267855]: time=2025-03-18T17:00:56.939+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type=""

Perhaps try settingOLLAMA_KV_CACHE_TYPE to one of q4_0, q8_0 or fp16. Having said that, gemma3 is having other problems with FA (https://github.com/ollama/ollama/issues/9683) so it might be resolved as part of that issue.

@rick-github commented on GitHub (Mar 19, 2025): ``` Mar 18 17:00:56 su8ai01 ollama[267855]: time=2025-03-18T17:00:56.939+01:00 level=INFO source=server.go:185 msg="enabling flash attention" Mar 18 17:00:56 su8ai01 ollama[267855]: time=2025-03-18T17:00:56.939+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type="" ``` Perhaps try setting`OLLAMA_KV_CACHE_TYPE` to one of q4_0, q8_0 or fp16. Having said that, gemma3 is having other problems with FA (https://github.com/ollama/ollama/issues/9683) so it might be resolved as part of that issue.

GiteaMirror commented

2026-05-04 14:15:03 -05:00

@ronaldvdmeer commented on GitHub (Mar 19, 2025):

I thought we had found the problem but it's still very unstable after a few prompts. Sometimes the cpu goes to 100% and nothing happens. Sometimes the proces memory spikes. Not sure. Are you guys able to replicate these stability issues?

@ronaldvdmeer commented on GitHub (Mar 19, 2025): I thought we had found the problem but it's still very unstable after a few prompts. Sometimes the cpu goes to 100% and nothing happens. Sometimes the proces memory spikes. Not sure. Are you guys able to replicate these stability issues?

GiteaMirror commented

2026-05-04 14:15:03 -05:00

@Ducky6944 commented on GitHub (Mar 20, 2025):

I thought we had found the problem but it's still very unstable after a few prompts. Sometimes the cpu goes to 100% and nothing happens. Sometimes the proces memory spikes. Not sure. Are you guys able to replicate these stability issues?

I am having similar issues using the gemma3:27b tag. Recently when it happened to me I had nvtop open and it showed that memory had gone to 0 and that the process wasn't showing. When I asked it another question. Memory and cpu ramped back up but it never generated anything. I'm running in docker. Not sure if that was related it or not.

@Ducky6944 commented on GitHub (Mar 20, 2025): > I thought we had found the problem but it's still very unstable after a few prompts. Sometimes the cpu goes to 100% and nothing happens. Sometimes the proces memory spikes. Not sure. Are you guys able to replicate these stability issues? I am having similar issues using the `gemma3:27b` tag. Recently when it happened to me I had nvtop open and it showed that memory had gone to 0 and that the process wasn't showing. When I asked it another question. Memory and cpu ramped back up but it never generated anything. I'm running in docker. Not sure if that was related it or not.

GiteaMirror commented

2026-05-04 14:15:04 -05:00

@ronaldvdmeer commented on GitHub (Mar 20, 2025):

I've experimented with unloading the model after 30 seconds. This works for a couple of prompts and then this happens:

root@su8ai01:~# ps aux  | grep ollama
ollama    620817 31.7  0.4 6935304 238816 ?      Ssl  09:44   3:29 /usr/local/bin/ollama serve
ronald    620834  0.1  0.0 1927064 31484 pts/5   Sl+  09:44   0:01 ollama run gemma3:27b-it-q4_K_M
ollama    622325 25.7  8.0 69847124 3949532 ?    Sl   09:49   1:31 /usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 8192 --batch-size 512 --n-gpu-layers 63 --threads 4 --parallel 4 --port 35997

Mar 20 09:42:20 su8ai01 systemd[1]: Stopped ollama.service - Ollama Service.
Mar 20 09:42:20 su8ai01 systemd[1]: ollama.service: Consumed 9h 11min 29.121s CPU time.
Mar 20 09:44:12 su8ai01 systemd[1]: Started ollama.service - Ollama Service.
Mar 20 09:44:12 su8ai01 ollama[620817]: 2025/03/20 09:44:12 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAM>
Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.543+01:00 level=INFO source=images.go:432 msg="total blobs: 30"
Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.543+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.544+01:00 level=INFO source=routes.go:1297 msg="Listening on [::]:11434 (version 0.6.2)"
Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.544+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.854+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" availabl>
Mar 20 09:44:36 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:36 | 200 |       48.14µs |       127.0.0.1 | HEAD     "/"
Mar 20 09:44:36 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:36 | 200 |   46.067577ms |       127.0.0.1 | POST     "/api/show"
Mar 20 09:44:36 su8ai01 ollama[620817]: time=2025-03-20T09:44:36.882+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.139+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.6 GiB" free_swap="0 B"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.140+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.278+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.280+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.283+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.308+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.309+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:39227"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.540+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices:
Mar 20 09:44:37 su8ai01 ollama[620817]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 20 09:44:37 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 20 09:44:37 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.565+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.693+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.693+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.991+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 20 09:44:41 su8ai01 ollama[620817]: time=2025-03-20T09:44:41.530+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.901+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.902+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.902+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.904+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.907+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:44:43 su8ai01 ollama[620817]: time=2025-03-20T09:44:43.036+01:00 level=INFO source=server.go:619 msg="llama runner started in 5.75 seconds"
Mar 20 09:44:43 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:43 | 200 |   6.57078618s |       127.0.0.1 | POST     "/api/generate"
Mar 20 09:45:02 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:45:02 | 200 |  6.351604303s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.439+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=>
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.664+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.666+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=>
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.770+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.772+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.775+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6>
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.781+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.789+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.790+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:42871"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices:
Mar 20 09:45:40 su8ai01 ollama[620817]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 20 09:45:40 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 20 09:45:40 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.932+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9>
Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.032+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.042+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.043+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.486+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 20 09:45:44 su8ai01 ollama[620817]: time=2025-03-20T09:45:44.061+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.227+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.227+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.228+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.230+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.233+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.316+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.54 seconds"
Mar 20 09:45:59 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:45:59 | 200 |  19.42089416s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:46:34 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:46:34 | 200 | 15.014740731s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:47:07 su8ai01 ollama[620817]: time=2025-03-20T09:47:07.790+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=>
Mar 20 09:47:07 su8ai01 ollama[620817]: time=2025-03-20T09:47:07.999+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.000+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=>
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.107+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.109+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.112+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6>
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.140+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.141+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:41231"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices:
Mar 20 09:47:08 su8ai01 ollama[620817]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 20 09:47:08 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 20 09:47:08 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.311+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9>
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.370+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.415+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.415+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.821+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 20 09:47:11 su8ai01 ollama[620817]: time=2025-03-20T09:47:11.392+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.717+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.717+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.718+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.720+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.723+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.901+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.78 seconds"
Mar 20 09:47:29 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:47:29 | 200 | 21.744458297s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.770+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=>
Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.996+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B"
Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.998+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=>
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.101+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.104+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.106+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6>
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.114+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.133+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.133+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:39309"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices:
Mar 20 09:48:10 su8ai01 ollama[620817]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 20 09:48:10 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 20 09:48:10 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.292+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9>
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.365+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.416+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.416+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.821+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 20 09:48:13 su8ai01 ollama[620817]: time=2025-03-20T09:48:13.338+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.596+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.599+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.604+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.844+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.73 seconds"
Mar 20 09:48:34 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:48:34 | 200 | 25.529366119s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.342+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.551+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.4 GiB" free_swap="0 B"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.553+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.662+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.664+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.666+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.680+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.683+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.684+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:35997"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices:
Mar 20 09:49:15 su8ai01 ollama[620817]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 20 09:49:15 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 20 09:49:15 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.846+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.933+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.947+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.947+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 20 09:49:16 su8ai01 ollama[620817]: time=2025-03-20T09:49:16.386+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 20 09:49:18 su8ai01 ollama[620817]: time=2025-03-20T09:49:18.936+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.059+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.061+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.191+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.52 seconds"
Mar 20 09:49:37 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:49:37 | 200 | 22.737085656s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:50:24 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:50:24 | 200 | 24.209365834s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:51:20 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:51:20 | 200 | 30.315333068s |       127.0.0.1 | POST     "/api/chat"

@ronaldvdmeer commented on GitHub (Mar 20, 2025): I've experimented with unloading the model after 30 seconds. This works for a couple of prompts and then this happens: ![Image](https://github.com/user-attachments/assets/a2c9de9e-2982-4122-b555-9138b380e4f6) ``` root@su8ai01:~# ps aux | grep ollama ollama 620817 31.7 0.4 6935304 238816 ? Ssl 09:44 3:29 /usr/local/bin/ollama serve ronald 620834 0.1 0.0 1927064 31484 pts/5 Sl+ 09:44 0:01 ollama run gemma3:27b-it-q4_K_M ollama 622325 25.7 8.0 69847124 3949532 ? Sl 09:49 1:31 /usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 8192 --batch-size 512 --n-gpu-layers 63 --threads 4 --parallel 4 --port 35997 ``` ``` Mar 20 09:42:20 su8ai01 systemd[1]: Stopped ollama.service - Ollama Service. Mar 20 09:42:20 su8ai01 systemd[1]: ollama.service: Consumed 9h 11min 29.121s CPU time. Mar 20 09:44:12 su8ai01 systemd[1]: Started ollama.service - Ollama Service. Mar 20 09:44:12 su8ai01 ollama[620817]: 2025/03/20 09:44:12 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAM> Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.543+01:00 level=INFO source=images.go:432 msg="total blobs: 30" Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.543+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.544+01:00 level=INFO source=routes.go:1297 msg="Listening on [::]:11434 (version 0.6.2)" Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.544+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.854+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" availabl> Mar 20 09:44:36 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:36 | 200 | 48.14µs | 127.0.0.1 | HEAD "/" Mar 20 09:44:36 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:36 | 200 | 46.067577ms | 127.0.0.1 | POST "/api/show" Mar 20 09:44:36 su8ai01 ollama[620817]: time=2025-03-20T09:44:36.882+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.139+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.6 GiB" free_swap="0 B" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.140+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.278+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.280+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.283+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.308+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.309+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:39227" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.540+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices: Mar 20 09:44:37 su8ai01 ollama[620817]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 20 09:44:37 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 20 09:44:37 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.565+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.693+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.693+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.991+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 20 09:44:41 su8ai01 ollama[620817]: time=2025-03-20T09:44:41.530+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.901+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.902+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.902+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.904+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.907+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:44:43 su8ai01 ollama[620817]: time=2025-03-20T09:44:43.036+01:00 level=INFO source=server.go:619 msg="llama runner started in 5.75 seconds" Mar 20 09:44:43 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:43 | 200 | 6.57078618s | 127.0.0.1 | POST "/api/generate" Mar 20 09:45:02 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:45:02 | 200 | 6.351604303s | 127.0.0.1 | POST "/api/chat" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.439+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=> Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.664+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.666+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=> Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.770+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.772+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.775+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6> Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.781+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.789+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.790+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:42871" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices: Mar 20 09:45:40 su8ai01 ollama[620817]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 20 09:45:40 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 20 09:45:40 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.932+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9> Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.032+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.042+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.043+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.486+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 20 09:45:44 su8ai01 ollama[620817]: time=2025-03-20T09:45:44.061+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.227+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.227+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.228+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.230+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.233+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.316+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.54 seconds" Mar 20 09:45:59 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:45:59 | 200 | 19.42089416s | 127.0.0.1 | POST "/api/chat" Mar 20 09:46:34 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:46:34 | 200 | 15.014740731s | 127.0.0.1 | POST "/api/chat" Mar 20 09:47:07 su8ai01 ollama[620817]: time=2025-03-20T09:47:07.790+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=> Mar 20 09:47:07 su8ai01 ollama[620817]: time=2025-03-20T09:47:07.999+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.000+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=> Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.107+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.109+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.112+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6> Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.140+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.141+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:41231" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices: Mar 20 09:47:08 su8ai01 ollama[620817]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 20 09:47:08 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 20 09:47:08 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.311+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9> Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.370+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.415+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.415+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.821+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 20 09:47:11 su8ai01 ollama[620817]: time=2025-03-20T09:47:11.392+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.717+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.717+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.718+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.720+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.723+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.901+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.78 seconds" Mar 20 09:47:29 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:47:29 | 200 | 21.744458297s | 127.0.0.1 | POST "/api/chat" Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.770+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=> Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.996+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B" Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.998+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=> Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.101+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.104+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.106+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6> Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.114+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.133+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.133+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:39309" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices: Mar 20 09:48:10 su8ai01 ollama[620817]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 20 09:48:10 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 20 09:48:10 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.292+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9> Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.365+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.416+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.416+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.821+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 20 09:48:13 su8ai01 ollama[620817]: time=2025-03-20T09:48:13.338+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.596+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.599+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.604+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.844+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.73 seconds" Mar 20 09:48:34 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:48:34 | 200 | 25.529366119s | 127.0.0.1 | POST "/api/chat" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.342+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.551+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.4 GiB" free_swap="0 B" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.553+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.662+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.664+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.666+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.680+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.683+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.684+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:35997" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices: Mar 20 09:49:15 su8ai01 ollama[620817]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 20 09:49:15 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 20 09:49:15 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.846+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.933+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.947+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.947+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 20 09:49:16 su8ai01 ollama[620817]: time=2025-03-20T09:49:16.386+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 20 09:49:18 su8ai01 ollama[620817]: time=2025-03-20T09:49:18.936+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.059+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.061+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.191+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.52 seconds" Mar 20 09:49:37 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:49:37 | 200 | 22.737085656s | 127.0.0.1 | POST "/api/chat" Mar 20 09:50:24 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:50:24 | 200 | 24.209365834s | 127.0.0.1 | POST "/api/chat" Mar 20 09:51:20 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:51:20 | 200 | 30.315333068s | 127.0.0.1 | POST "/api/chat" ```

GiteaMirror commented

2026-05-04 14:15:04 -05:00

@ronaldvdmeer commented on GitHub (Mar 26, 2025):

Am I the only one?

@ronaldvdmeer commented on GitHub (Mar 26, 2025): Am I the only one?

GiteaMirror commented

2026-05-04 14:15:04 -05:00

@rick-github commented on GitHub (Mar 26, 2025):

It seems so.

It might help if we can pinpoint where in the call chain the stalling happens. The server and the runner communicate via the port number specified by --port. If you start a listener on that port:

tcpflow -c -i lo port $PORT

and use the client to send requests, does traffic on the port stop when the model stops responding?

@rick-github commented on GitHub (Mar 26, 2025): It seems so. It might help if we can pinpoint where in the call chain the stalling happens. The server and the runner communicate via the port number specified by `--port`. If you start a listener on that port: ```consoole tcpflow -c -i lo port $PORT ``` and use the client to send requests, does traffic on the port stop when the model stops responding?

GiteaMirror commented

2026-05-04 14:15:05 -05:00

@Ducky6944 commented on GitHub (Mar 26, 2025):

Whats the output of nvidia-smi -q. In my case my issue seems to have been caused by using an "unlicensed" VGPU. I haven't run into this since resolving the licensing issue. Despite the fact it was reporting as unrestricted, which in my mind I took at face value.

@Ducky6944 commented on GitHub (Mar 26, 2025): Whats the output of `nvidia-smi -q.` In my case my issue seems to have been caused by using an "unlicensed" VGPU. I haven't run into this since resolving the licensing issue. Despite the fact it was reporting as `unrestricted`, which in my mind I took at face value.

GiteaMirror commented

2026-05-04 14:15:05 -05:00

@ronaldvdmeer commented on GitHub (Mar 26, 2025):

It seems so.

It might help if we can pinpoint where in the call chain the stalling happens. The server and the runner communicate via the port number specified by --port. If you start a listener on that port:
tcpflow -c -i lo port $PORT
and use the client to send requests, does traffic on the port stop when the model stops responding?

https://pastebin.com/raw/4FyGnHuQ
With the last 3 or 4 prompts there was no response from the model anymore. Traffic pattern changed and was very limited. Not sure what that means.

nvidia-smi -q

Here is the output. https://pastebin.com/raw/hNPC2dvT

@ronaldvdmeer commented on GitHub (Mar 26, 2025): > It seems so. > > It might help if we can pinpoint where in the call chain the stalling happens. The server and the runner communicate via the port number specified by `--port`. If you start a listener on that port: > > ``` > tcpflow -c -i lo port $PORT > ``` > > and use the client to send requests, does traffic on the port stop when the model stops responding? [https://pastebin.com/raw/4FyGnHuQ](https://pastebin.com/raw/4FyGnHuQ) With the last 3 or 4 prompts there was no response from the model anymore. Traffic pattern changed and was very limited. Not sure what that means. > nvidia-smi -q Here is the output. [https://pastebin.com/raw/hNPC2dvT](https://pastebin.com/raw/hNPC2dvT)

GiteaMirror commented

2026-05-04 14:15:05 -05:00

@ronaldvdmeer commented on GitHub (Mar 26, 2025):

After extensive testing, I found that the GPU instability I was experiencing when running Ollama inside a Proxmox VM with passthrough was likely caused by a virtual IOMMU (vIOMMU) being presented to the VM.

Even though I hadn’t explicitly enabled IOMMU support inside the guest OS, the Proxmox configuration was set to expose a virtual IOMMU. The Linux kernel inside the VM automatically detected and enabled it, which I believe led to unexpected behavior with CUDA workloads.

Once I disabled the virtual IOMMU in the VM configuration, all instability disappeared. The system has been completely stable since, and performance is consistent.

Does this seem logical?

@ronaldvdmeer commented on GitHub (Mar 26, 2025): After extensive testing, I found that the GPU instability I was experiencing when running Ollama inside a Proxmox VM with passthrough was likely caused by a virtual IOMMU (vIOMMU) being presented to the VM. Even though I hadn’t explicitly enabled IOMMU support inside the guest OS, the Proxmox configuration was set to expose a virtual IOMMU. The Linux kernel inside the VM automatically detected and enabled it, which I believe led to unexpected behavior with CUDA workloads. Once I disabled the virtual IOMMU in the VM configuration, all instability disappeared. The system has been completely stable since, and performance is consistent. Does this seem logical?

GiteaMirror commented

2026-05-04 14:15:06 -05:00

@rick-github commented on GitHub (Mar 26, 2025):

Sounds logical. There are other reports of IOMMU settings causing issues, but usually by generating random tokens, not getting wedged. Let it bake for a bit and if you feel the problem is resolved, close the ticket, and feel free to re-open if the problem re-occurs.

@rick-github commented on GitHub (Mar 26, 2025): Sounds logical. There are other reports of IOMMU settings causing issues, but usually by generating random tokens, not getting wedged. Let it bake for a bit and if you feel the problem is resolved, close the ticket, and feel free to re-open if the problem re-occurs.

GiteaMirror commented

2026-05-04 14:15:06 -05:00

@Johnno1011 commented on GitHub (Mar 27, 2025):

Hey all.
Think this issue is the most related to what I'm experiencing with gemma3:27b.

I'm seeing that with running the model entirely on CPU I'm unable to get any response at all. If I wait long enough in the UI, I get tokens but it's EXTREMELY slow. My machine also goes flat out, for seemingly little output. See attached screenshot. I have tried turning off quantisation and flash attention etc but this has no affect.
Cheers.

@Johnno1011 commented on GitHub (Mar 27, 2025): Hey all. Think this issue is the most related to what I'm experiencing with gemma3:27b. I'm seeing that with running the model entirely on CPU I'm unable to get any response at all. If I wait long enough in the UI, I get tokens but it's EXTREMELY slow. My machine also goes flat out, for seemingly little output. See attached screenshot. I have tried turning off quantisation and flash attention etc but this has no affect. Cheers. <img width="945" alt="Image" src="https://github.com/user-attachments/assets/7e1e82a8-c1e0-4fd0-a1ba-38ca70e5dbe6" />

GiteaMirror commented

2026-05-04 14:15:06 -05:00

@rick-github commented on GitHub (Mar 27, 2025):

Server logs will aid in debugging.

@rick-github commented on GitHub (Mar 27, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.

GiteaMirror commented

2026-05-04 14:15:07 -05:00

@Johnno1011 commented on GitHub (Mar 27, 2025):

Server startup logs:

time=2025-03-27T15:04:48.367Z level=INFO source=server.go:105 msg="system memory" total="125.8 GiB" free="95.4 GiB" free_swap="0 B"
time=2025-03-27T15:04:48.369Z level=INFO source=server.go:138 msg=offload library=cpu layers.requested=-1 layers.model=63 layers.offload=0 layers.split="" memory.available="[95.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.9 GiB" memory.required.partial="0 B" memory.required.kv="992.0 MiB" memory.required.allocations="[18.9 GiB]" memory.weights.total="14.3 GiB" memory.weights.repeating="14.3 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="522.5 MiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-03-27T15:04:48.369Z level=WARN source=server.go:196 msg="quantized kv cache requested but flash attention disabled" type=f32
time=2025-03-27T15:04:48.467Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-27T15:04:48.471Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-27T15:04:48.474Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-27T15:04:48.482Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /mnt/dsdrive/blobs/sha256-e796792eba26c4d3b04b0ac5adb01a453dd9ec2dfd83b6c59cbf6fe5f30b0f68 --ctx-size 2048 --batch-size 512 --threads 64 --no-mmap --parallel 1 --port 40519"
time=2025-03-27T15:04:48.482Z level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-27T15:04:48.482Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-03-27T15:04:48.483Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-03-27T15:04:48.496Z level=INFO source=runner.go:763 msg="starting ollama engine"
time=2025-03-27T15:04:48.496Z level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:40519"
time=2025-03-27T15:04:48.595Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-03-27T15:04:48.595Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-03-27T15:04:48.595Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=37
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-03-27T15:04:48.600Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-03-27T15:04:48.605Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="17.3 GiB"
time=2025-03-27T15:04:48.750Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
time=2025-03-27T15:04:54.129Z level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CPU
time=2025-03-27T15:04:54.129Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-27T15:04:54.133Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-27T15:04:54.137Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-27T15:04:54.257Z level=INFO source=server.go:619 msg="llama runner started in 5.77 seconds"

@Johnno1011 commented on GitHub (Mar 27, 2025): Server startup logs: ``` time=2025-03-27T15:04:48.367Z level=INFO source=server.go:105 msg="system memory" total="125.8 GiB" free="95.4 GiB" free_swap="0 B" time=2025-03-27T15:04:48.369Z level=INFO source=server.go:138 msg=offload library=cpu layers.requested=-1 layers.model=63 layers.offload=0 layers.split="" memory.available="[95.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.9 GiB" memory.required.partial="0 B" memory.required.kv="992.0 MiB" memory.required.allocations="[18.9 GiB]" memory.weights.total="14.3 GiB" memory.weights.repeating="14.3 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="522.5 MiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-03-27T15:04:48.369Z level=WARN source=server.go:196 msg="quantized kv cache requested but flash attention disabled" type=f32 time=2025-03-27T15:04:48.467Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-27T15:04:48.471Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-27T15:04:48.474Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-27T15:04:48.482Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /mnt/dsdrive/blobs/sha256-e796792eba26c4d3b04b0ac5adb01a453dd9ec2dfd83b6c59cbf6fe5f30b0f68 --ctx-size 2048 --batch-size 512 --threads 64 --no-mmap --parallel 1 --port 40519" time=2025-03-27T15:04:48.482Z level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-27T15:04:48.482Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-03-27T15:04:48.483Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-03-27T15:04:48.496Z level=INFO source=runner.go:763 msg="starting ollama engine" time=2025-03-27T15:04:48.496Z level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:40519" time=2025-03-27T15:04:48.595Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-03-27T15:04:48.595Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-03-27T15:04:48.595Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=37 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-03-27T15:04:48.600Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-03-27T15:04:48.605Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="17.3 GiB" time=2025-03-27T15:04:48.750Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" time=2025-03-27T15:04:54.129Z level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CPU time=2025-03-27T15:04:54.129Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-27T15:04:54.133Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-27T15:04:54.137Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-27T15:04:54.257Z level=INFO source=server.go:619 msg="llama runner started in 5.77 seconds" ```

GiteaMirror commented

2026-05-04 14:15:08 -05:00

@Johnno1011 commented on GitHub (Mar 27, 2025):

Sorry just to add to this, quick rundown is I'm running on CPU only here and have found the model is way way slower than I would expect. Putting the model on the GPU did sort the issue out as of course it's quicker. Anyway, feel free to discard my points as the CPU speed issue is something unrelated to this specific ticket :)

@Johnno1011 commented on GitHub (Mar 27, 2025): Sorry just to add to this, quick rundown is I'm running on CPU only here and have found the model is way way slower than I would expect. Putting the model on the GPU did sort the issue out as of course it's quicker. Anyway, feel free to discard my points as the CPU speed issue is something unrelated to this specific ticket :)

GiteaMirror commented

2026-05-04 14:15:10 -05:00

@rick-github commented on GitHub (Mar 27, 2025):

When running on CPU, the constraint for large models is memory bandwidth. gemma3:27b wants about 19G of RAM. My test system has 5600Mhz DDR-5 RAM, with a memory bandwidth of about 69GB/s. That means the model can generate at most 69/19 tps or 3.6 tps. The output of ollama run --verbose shows 3.42 tps so close to the max. If you are seeing slow inference on the CPU, your memory bandwidth is probably lower. I see that your CPU is Haswell architecture, doing a bit of a search it looks like DDR-3 and DDR-4 memory subsytems were common for Haswell, with a max memory bandwidth of 25G/s with 3200Mhz DDR-4.

Everything else in the log looks normal, so i think this is just a case of hardware limitations.

@rick-github commented on GitHub (Mar 27, 2025): When running on CPU, the constraint for large models is memory bandwidth. gemma3:27b wants about 19G of RAM. My test system has 5600Mhz DDR-5 RAM, with a memory bandwidth of about 69GB/s. That means the model can generate at most 69/19 tps or 3.6 tps. The output of `ollama run --verbose` shows 3.42 tps so close to the max. If you are seeing slow inference on the CPU, your memory bandwidth is probably lower. I see that your CPU is Haswell architecture, doing a bit of a search it looks like DDR-3 and DDR-4 memory subsytems were common for Haswell, with a max memory bandwidth of 25G/s with 3200Mhz DDR-4. Everything else in the log looks normal, so i think this is just a case of hardware limitations.

GiteaMirror commented

2026-05-04 14:15:10 -05:00

@Johnno1011 commented on GitHub (Mar 27, 2025):

When running on CPU, the constraint for large models is memory bandwidth. gemma3:27b wants about 19G of RAM. My test system has 5600Mhz DDR-5 RAM, with a memory bandwidth of about 69GB/s. That means the model can generate at most 69/19 tps or 3.6 tps. The output of ollama run --verbose shows 3.42 tps so close to the max. If you are seeing slow inference on the CPU, your memory bandwidth is probably lower. I see that your CPU is Haswell architecture, doing a bit of a search it looks like DDR-3 and DDR-4 memory subsytems were common for Haswell, with a max memory bandwidth of 25G/s with 3200Mhz DDR-4.

Everything else in the log looks normal, so i think this is just a case of hardware limitations.

This is really useful information, thank you!
Based on my system, I have approx 24GB/s of DDR4 bandwidth. 24/19 = 1.26 tk/s. I am not getting this. I am running on a 64 core CPU machine. Any thoughts? Thanks

@Johnno1011 commented on GitHub (Mar 27, 2025): > When running on CPU, the constraint for large models is memory bandwidth. gemma3:27b wants about 19G of RAM. My test system has 5600Mhz DDR-5 RAM, with a memory bandwidth of about 69GB/s. That means the model can generate at most 69/19 tps or 3.6 tps. The output of `ollama run --verbose` shows 3.42 tps so close to the max. If you are seeing slow inference on the CPU, your memory bandwidth is probably lower. I see that your CPU is Haswell architecture, doing a bit of a search it looks like DDR-3 and DDR-4 memory subsytems were common for Haswell, with a max memory bandwidth of 25G/s with 3200Mhz DDR-4. > > Everything else in the log looks normal, so i think this is just a case of hardware limitations. This is really useful information, thank you! Based on my system, I have approx 24GB/s of DDR4 bandwidth. 24/19 = 1.26 tk/s. I am not getting this. I am running on a 64 core CPU machine. Any thoughts? Thanks

GiteaMirror commented

2026-05-04 14:15:10 -05:00

@rick-github commented on GitHub (Mar 27, 2025):

Any thoughts?

Possibly thread contention or cache thrashing. What happens if you reduce the number of threads. Try 1 and 31, if there's a change in performance do a binary search to find the best value:

$ ollama run gemma3:27b --verbose
>>> hello
Hello there! 👋 

How can I help you today? Just let me know what you're thinking, or if you just wanted to say hi, that's great too! 

I can:

* **Answer questions:** About pretty much anything!
* **Generate creative content:** Like stories, poems, code, scripts, musical pieces, email, letters, etc.
* **Translate languages.**
* **Summarize text.**
* **Help with brainstorming.**
* **Just chat!**


total duration:       33.118166262s
load duration:        54.58724ms
prompt eval count:    10 token(s)
prompt eval duration: 1.410197875s
prompt eval rate:     7.09 tokens/s
eval count:           108 token(s)
eval duration:        31.652382454s
eval rate:            3.41 tokens/s

>>> /set parameter num_thread 1
Set parameter 'num_thread' to '1'
>>> say hello, concisely
Hello! 👋


total duration:       1m46.132958402s
load duration:        3.663317777s
prompt eval count:    131 token(s)
prompt eval duration: 1m38.085952912s
prompt eval rate:     1.34 tokens/s
eval count:           5 token(s)
eval duration:        4.365867083s
eval rate:            1.15 tokens/s

>>> /set parameter num_thread 31
Set parameter 'num_thread' to '31'
>>> say hello, concisely
Hi! 😊


total duration:       2m9.401733529s
load duration:        3.883104395s
prompt eval count:    149 token(s)
prompt eval duration: 1m2.323025105s
prompt eval rate:     2.39 tokens/s
eval count:           5 token(s
eval duration:        1m3.166008742s
eval rate:            0.08 tokens/s

@rick-github commented on GitHub (Mar 27, 2025): > Any thoughts? Possibly thread contention or cache thrashing. What happens if you reduce the number of threads. Try 1 and 31, if there's a change in performance do a binary search to find the best value: ```console $ ollama run gemma3:27b --verbose >>> hello Hello there! 👋 How can I help you today? Just let me know what you're thinking, or if you just wanted to say hi, that's great too! I can: * **Answer questions:** About pretty much anything! * **Generate creative content:** Like stories, poems, code, scripts, musical pieces, email, letters, etc. * **Translate languages.** * **Summarize text.** * **Help with brainstorming.** * **Just chat!** total duration: 33.118166262s load duration: 54.58724ms prompt eval count: 10 token(s) prompt eval duration: 1.410197875s prompt eval rate: 7.09 tokens/s eval count: 108 token(s) eval duration: 31.652382454s eval rate: 3.41 tokens/s ``` ```console >>> /set parameter num_thread 1 Set parameter 'num_thread' to '1' >>> say hello, concisely Hello! 👋 total duration: 1m46.132958402s load duration: 3.663317777s prompt eval count: 131 token(s) prompt eval duration: 1m38.085952912s prompt eval rate: 1.34 tokens/s eval count: 5 token(s) eval duration: 4.365867083s eval rate: 1.15 tokens/s ``` ```console >>> /set parameter num_thread 31 Set parameter 'num_thread' to '31' >>> say hello, concisely Hi! 😊 total duration: 2m9.401733529s load duration: 3.883104395s prompt eval count: 149 token(s) prompt eval duration: 1m2.323025105s prompt eval rate: 2.39 tokens/s eval count: 5 token(s eval duration: 1m3.166008742s eval rate: 0.08 tokens/s ```

GiteaMirror commented

2026-05-04 14:15:11 -05:00

@ronaldvdmeer commented on GitHub (Mar 27, 2025):

Sounds logical. There are other reports of IOMMU settings causing issues, but usually by generating random tokens, not getting wedged. Let it bake for a bit and if you feel the problem is resolved, close the ticket, and feel free to re-open if the problem re-occurs.

Just wanted to confirm that after another full day of heavy use, the system remains completely stable. I haven’t had to reboot even once — which is a first since I started using GPU passthrough with Ollama. Disabling the virtual IOMMU in the VM configuration clearly solved it for me.

I realize the thread now includes a separate issue related to CPU behavior, which may be unrelated to this specific passthrough problem. Hopefully, this update still helps others who run into similar stability problems with NVIDIA GPUs and virtual IOMMUs in Proxmox environments.

@ronaldvdmeer commented on GitHub (Mar 27, 2025): > Sounds logical. There are other reports of IOMMU settings causing issues, but usually by generating random tokens, not getting wedged. Let it bake for a bit and if you feel the problem is resolved, close the ticket, and feel free to re-open if the problem re-occurs. Just wanted to confirm that after another full day of heavy use, the system remains completely stable. I haven’t had to reboot even once — which is a first since I started using GPU passthrough with Ollama. Disabling the virtual IOMMU in the VM configuration clearly solved it for me. I realize the thread now includes a separate issue related to CPU behavior, which may be unrelated to this specific passthrough problem. Hopefully, this update still helps others who run into similar stability problems with NVIDIA GPUs and virtual IOMMUs in Proxmox environments.

GiteaMirror commented

2026-05-04 14:15:11 -05:00

@ronaldvdmeer commented on GitHub (Mar 27, 2025):

Thanks everyone — I’m going to close this issue since the original GPU passthrough instability is now fully resolved.
For those experiencing unrelated issues (like CPU behavior), it’s probably best to open a separate ticket to keep things focused.

@ronaldvdmeer commented on GitHub (Mar 27, 2025): Thanks everyone — I’m going to close this issue since the original GPU passthrough instability is now fully resolved. For those experiencing unrelated issues (like CPU behavior), it’s probably best to open a separate ticket to keep things focused.

GiteaMirror commented

2026-05-04 14:15:11 -05:00

@rick-github commented on GitHub (Mar 27, 2025):

Thanks @ronaldvdmeer , these corner cases are always a bugbear and I'm glad it's been resolved.

@rick-github commented on GitHub (Mar 27, 2025): Thanks @ronaldvdmeer , these corner cases are always a bugbear and I'm glad it's been resolved.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#68511