[GH-ISSUE #9857] Ollama v0.6.2 - Gemma3 Model Stops Responding After a Few Prompts #52966

Closed
opened 2026-04-29 01:29:17 -05:00 by GiteaMirror · 27 comments
Owner

Originally created by @ronaldvdmeer on GitHub (Mar 18, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9857

What is the issue?

When using Ollama v0.6.2 with the model gemma3:27b-it-q4_K_M, the model stops responding after a few interactions. There is no error message, but it no longer generates any output after receiving a prompt.

Environment
• Ollama version: v0.6.2
• Model: gemma3:27b-it-q4_K_M
• OS: Debian 12
• Hardware:
• GPU: NVIDIA RTX 3090
• RAM: 48GB

Steps to Reproduce
Start Ollama and load the model:

ollama run gemma3:27b-it-q4_K_M

Ask a few questions, for example:

>>> hello there, how are you doing today
>>> good to hear my friend. can you tell me something interesting about March 18th
>>> yes sure
>>> what was the name of the spacecraft

After a few responses, the model suddenly stops responding, with no error message displayed.

Expected Behavior
The model should continue generating responses without interruption.

Actual Outcome
After a few successful interactions, the model becomes unresponsive without any error message.

Additional Information
• There is no consistent pattern as to when the issue occurs.
• No errors are logged in the console.
• CPU and GPU usage remain within normal levels.
• Restarting ollama temporarily restores functionality.

Any guidance on resolving this issue or debugging further would be greatly appreciated.

Relevant log output

root@su8ai01:~# ollama show gemma3:27b-it-q4_K_M
  Model
    architecture        gemma3    
    parameters          27.4B     
    context length      8192      
    embedding length    5376      
    quantization        Q4_K_M    

  Parameters
    stop           "<end_of_turn>"    
    temperature    0.1                

  License
    Gemma Terms of Use                  
    Last modified: February 21, 2024    

root@su8ai01:~# ollama run gemma3:27b-it-q4_K_M
>>> hello there, how are you doing today
Hello! As an AI, I don't *experience* feelings like "doing well," but I'm functioning perfectly and ready 
to help! So you could say I'm doing great! 😄 

How about *you*? How are *you* doing today? I hope you're having a good one so far. 

Is there anything I can help you with?





>>> good to hear my friend. can you tell me something interesting about march 18th
You're kind! 😊 Okay, here's something interesting about March 18th:

**On March 18th, 1937, the first blood bank opened in Chicago!** 

It was established by Dr. Bernard Fantus, who pioneered the concept of storing blood for future 
transfusions. Before this, transfusions were often done directly from donor to recipient, which was risky 
and time-consuming. Dr. Fantus realized the need for a readily available supply of blood, and his work 
revolutionized medical care.

Pretty cool, right? It's a day that significantly impacted the field of medicine and saved countless lives!

Would you like to know another interesting fact about March 18th, or perhaps a fact about a different date?





>>> yes sure
Alright! Here's another interesting fact about March 18th:

**On March 18th, 1965, Alexei Leonov, a Soviet cosmonaut, performed the first spacewalk!**

He exited the Voskhod 2 spacecraft and spent 12 minutes and 9 seconds outside the vehicle, tethered by a 
5.35-meter (17.6 ft) umbilical cord. It was a huge moment in the Space Race and a significant achievement 
in human space exploration. He faced some challenges - his suit inflated, making it difficult to re-enter 
the airlock - but he managed it successfully!

Pretty daring, huh? 🚀





>>> what was the name of the spacecraft
The spacecraft Alexei Leonov spacewalked from was called **Voskhod 2** (Восход-2 in Russian).

It was a modified version of the Vostok spacecraft, and it was specifically adapted to allow for a 
spacewalk. It carried a crew of two: Pavel Belyayev (commander) and Alexei Leonov (pilot/spacewalker).

It's interesting

>>> very interesting


>>> can you tell me something more about today


>>> hello?


>>> /bye
root@su8ai01:~# ollama ps
NAME                    ID              SIZE     PROCESSOR    UNTIL   
gemma3:27b-it-q4_K_M    30ddded7fba6    24 GB    100% GPU     Forever    

root@su8ai01:~# topshort
top - 16:07:45 up 22:38,  8 users,  load average: 0.01, 0.13, 0.71
Tasks: 188 total,   1 running, 187 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 50.0 sy,  0.0 ni, 50.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
MiB Mem :  48176.7 total,  21945.4 free,   5502.9 used,  21675.4 buff/cache     
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  42673.8 avail Mem 

root@su8ai01:~# nvidia-smi
Tue Mar 18 16:06:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:06:10.0 Off |                  N/A |
|  0%   36C    P8             17W /  350W |   20968MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    255476      C   /usr/local/bin/ollama                       20962MiB |
+-----------------------------------------------------------------------------------------+

OS

Linux

GPU

Nvidia

CPU

Other

Ollama version

ollama version is 0.6.2

Originally created by @ronaldvdmeer on GitHub (Mar 18, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9857 ### What is the issue? When using `Ollama v0.6.2` with the model `gemma3:27b-it-q4_K_M`, the model stops responding after a few interactions. There is no error message, but it no longer generates any output after receiving a prompt. **Environment** • Ollama version: v0.6.2 • Model: gemma3:27b-it-q4_K_M • OS: Debian 12 • Hardware: • GPU: NVIDIA RTX 3090 • RAM: 48GB **Steps to Reproduce** Start Ollama and load the model: ``` ollama run gemma3:27b-it-q4_K_M ``` Ask a few questions, for example: ``` >>> hello there, how are you doing today >>> good to hear my friend. can you tell me something interesting about March 18th >>> yes sure >>> what was the name of the spacecraft ``` After a few responses, the model suddenly stops responding, with no error message displayed. **Expected Behavior** The model should continue generating responses without interruption. **Actual Outcome** After a few successful interactions, the model becomes unresponsive without any error message. **Additional Information** • There is no consistent pattern as to when the issue occurs. • No errors are logged in the console. • CPU and GPU usage remain within normal levels. • Restarting ollama temporarily restores functionality. Any guidance on resolving this issue or debugging further would be greatly appreciated. ### Relevant log output ```shell root@su8ai01:~# ollama show gemma3:27b-it-q4_K_M Model architecture gemma3 parameters 27.4B context length 8192 embedding length 5376 quantization Q4_K_M Parameters stop "<end_of_turn>" temperature 0.1 License Gemma Terms of Use Last modified: February 21, 2024 root@su8ai01:~# ollama run gemma3:27b-it-q4_K_M >>> hello there, how are you doing today Hello! As an AI, I don't *experience* feelings like "doing well," but I'm functioning perfectly and ready to help! So you could say I'm doing great! 😄 How about *you*? How are *you* doing today? I hope you're having a good one so far. Is there anything I can help you with? >>> good to hear my friend. can you tell me something interesting about march 18th You're kind! 😊 Okay, here's something interesting about March 18th: **On March 18th, 1937, the first blood bank opened in Chicago!** It was established by Dr. Bernard Fantus, who pioneered the concept of storing blood for future transfusions. Before this, transfusions were often done directly from donor to recipient, which was risky and time-consuming. Dr. Fantus realized the need for a readily available supply of blood, and his work revolutionized medical care. Pretty cool, right? It's a day that significantly impacted the field of medicine and saved countless lives! Would you like to know another interesting fact about March 18th, or perhaps a fact about a different date? >>> yes sure Alright! Here's another interesting fact about March 18th: **On March 18th, 1965, Alexei Leonov, a Soviet cosmonaut, performed the first spacewalk!** He exited the Voskhod 2 spacecraft and spent 12 minutes and 9 seconds outside the vehicle, tethered by a 5.35-meter (17.6 ft) umbilical cord. It was a huge moment in the Space Race and a significant achievement in human space exploration. He faced some challenges - his suit inflated, making it difficult to re-enter the airlock - but he managed it successfully! Pretty daring, huh? 🚀 >>> what was the name of the spacecraft The spacecraft Alexei Leonov spacewalked from was called **Voskhod 2** (Восход-2 in Russian). It was a modified version of the Vostok spacecraft, and it was specifically adapted to allow for a spacewalk. It carried a crew of two: Pavel Belyayev (commander) and Alexei Leonov (pilot/spacewalker). It's interesting >>> very interesting >>> can you tell me something more about today >>> hello? >>> /bye root@su8ai01:~# ollama ps NAME ID SIZE PROCESSOR UNTIL gemma3:27b-it-q4_K_M 30ddded7fba6 24 GB 100% GPU Forever root@su8ai01:~# topshort top - 16:07:45 up 22:38, 8 users, load average: 0.01, 0.13, 0.71 Tasks: 188 total, 1 running, 187 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 50.0 sy, 0.0 ni, 50.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 48176.7 total, 21945.4 free, 5502.9 used, 21675.4 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 42673.8 avail Mem root@su8ai01:~# nvidia-smi Tue Mar 18 16:06:19 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:06:10.0 Off | N/A | | 0% 36C P8 17W / 350W | 20968MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 255476 C /usr/local/bin/ollama 20962MiB | +-----------------------------------------------------------------------------------------+ ``` ### OS Linux ### GPU Nvidia ### CPU Other ### Ollama version ollama version is 0.6.2
GiteaMirror added the bug label 2026-04-29 01:29:17 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 18, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:2733783795 --> @rick-github commented on GitHub (Mar 18, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@ronaldvdmeer commented on GitHub (Mar 18, 2025):

Server logs will aid in debugging.

Mar 18 16:55:27 su8ai01 systemd[1]: Stopped ollama.service - Ollama Service.
Mar 18 16:55:27 su8ai01 systemd[1]: ollama.service: Consumed 1min 36.945s CPU time.
Mar 18 16:55:27 su8ai01 systemd[1]: Started ollama.service - Ollama Service.
Mar 18 16:55:27 su8ai01 ollama[265442]: 2025/03/18 16:55:27 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/data/OllamaModels OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.725+01:00 level=INFO source=images.go:432 msg="total blobs: 25"
Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.725+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.726+01:00 level=INFO source=routes.go:1297 msg="Listening on [::]:11434 (version 0.6.2)"
Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.726+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Mar 18 16:55:28 su8ai01 ollama[265442]: time=2025-03-18T16:55:28.547+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="23.4 GiB"
Mar 18 16:55:39 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:39 | 200 |       48.18µs |       127.0.0.1 | HEAD     "/"
Mar 18 16:55:39 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:39 | 200 |   47.005366ms |       127.0.0.1 | POST     "/api/show"
Mar 18 16:55:39 su8ai01 ollama[265442]: time=2025-03-18T16:55:39.789+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 parallel=4 available=25158156288 required="22.4 GiB"
Mar 18 16:55:39 su8ai01 ollama[265442]: time=2025-03-18T16:55:39.988+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="45.1 GiB" free_swap="0 B"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.4 GiB" memory.required.partial="22.4 GiB" memory.required.kv="3.9 GiB" memory.required.allocations="[22.4 GiB]" memory.weights.total="14.3 GiB" memory.weights.repeating="14.3 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="565.0 MiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=INFO source=server.go:185 msg="enabling flash attention"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type=""
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.340+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.352+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.355+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 8192 --batch-size 512 --n-gpu-layers 63 --threads 4 --flash-attn --parallel 4 --port 38759"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.390+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.390+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:38759"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.623+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: found 1 CUDA devices:
Mar 18 16:55:40 su8ai01 ollama[265442]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 18 16:55:40 su8ai01 ollama[265442]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 18 16:55:40 su8ai01 ollama[265442]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.689+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.785+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.785+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 18 16:55:41 su8ai01 ollama[265442]: time=2025-03-18T16:55:41.075+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 18 16:55:45 su8ai01 ollama[265442]: time=2025-03-18T16:55:45.985+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.280+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.283+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.492+01:00 level=INFO source=server.go:619 msg="llama runner started in 7.12 seconds"
Mar 18 16:55:47 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:47 | 200 |  8.252148192s |       127.0.0.1 | POST     "/api/generate"
Mar 18 16:55:56 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:56 | 200 |  3.229971972s |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:03 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:03 | 200 |  2.951987604s |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:15 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:15 | 200 |  3.245265393s |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:22 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:22 | 200 |  2.716038903s |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:28 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:28 | 200 |  2.794662842s |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:34 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:34 | 200 |  134.447763ms |       127.0.0.1 | POST     "/api/chat"
Mar 18 16:56:38 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:38 | 200 |   98.775391ms |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:2733800226 --> @ronaldvdmeer commented on GitHub (Mar 18, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging. ``` Mar 18 16:55:27 su8ai01 systemd[1]: Stopped ollama.service - Ollama Service. Mar 18 16:55:27 su8ai01 systemd[1]: ollama.service: Consumed 1min 36.945s CPU time. Mar 18 16:55:27 su8ai01 systemd[1]: Started ollama.service - Ollama Service. Mar 18 16:55:27 su8ai01 ollama[265442]: 2025/03/18 16:55:27 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/data/OllamaModels OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.725+01:00 level=INFO source=images.go:432 msg="total blobs: 25" Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.725+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.726+01:00 level=INFO source=routes.go:1297 msg="Listening on [::]:11434 (version 0.6.2)" Mar 18 16:55:27 su8ai01 ollama[265442]: time=2025-03-18T16:55:27.726+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Mar 18 16:55:28 su8ai01 ollama[265442]: time=2025-03-18T16:55:28.547+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="23.4 GiB" Mar 18 16:55:39 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:39 | 200 | 48.18µs | 127.0.0.1 | HEAD "/" Mar 18 16:55:39 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:39 | 200 | 47.005366ms | 127.0.0.1 | POST "/api/show" Mar 18 16:55:39 su8ai01 ollama[265442]: time=2025-03-18T16:55:39.789+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 parallel=4 available=25158156288 required="22.4 GiB" Mar 18 16:55:39 su8ai01 ollama[265442]: time=2025-03-18T16:55:39.988+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="45.1 GiB" free_swap="0 B" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.4 GiB" memory.required.partial="22.4 GiB" memory.required.kv="3.9 GiB" memory.required.allocations="[22.4 GiB]" memory.weights.total="14.3 GiB" memory.weights.repeating="14.3 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="565.0 MiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=INFO source=server.go:185 msg="enabling flash attention" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.188+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type="" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.340+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.352+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.355+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.370+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 8192 --batch-size 512 --n-gpu-layers 63 --threads 4 --flash-attn --parallel 4 --port 38759" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.371+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.390+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.390+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:38759" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.558+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.623+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 18 16:55:40 su8ai01 ollama[265442]: ggml_cuda_init: found 1 CUDA devices: Mar 18 16:55:40 su8ai01 ollama[265442]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 18 16:55:40 su8ai01 ollama[265442]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 18 16:55:40 su8ai01 ollama[265442]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.689+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.785+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 18 16:55:40 su8ai01 ollama[265442]: time=2025-03-18T16:55:40.785+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 18 16:55:41 su8ai01 ollama[265442]: time=2025-03-18T16:55:41.075+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 18 16:55:45 su8ai01 ollama[265442]: time=2025-03-18T16:55:45.985+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.278+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.280+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.283+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.289+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 18 16:55:47 su8ai01 ollama[265442]: time=2025-03-18T16:55:47.492+01:00 level=INFO source=server.go:619 msg="llama runner started in 7.12 seconds" Mar 18 16:55:47 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:47 | 200 | 8.252148192s | 127.0.0.1 | POST "/api/generate" Mar 18 16:55:56 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:55:56 | 200 | 3.229971972s | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:03 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:03 | 200 | 2.951987604s | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:15 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:15 | 200 | 3.245265393s | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:22 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:22 | 200 | 2.716038903s | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:28 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:28 | 200 | 2.794662842s | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:34 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:34 | 200 | 134.447763ms | 127.0.0.1 | POST "/api/chat" Mar 18 16:56:38 su8ai01 ollama[265442]: [GIN] 2025/03/18 - 16:56:38 | 200 | 98.775391ms | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@ronaldvdmeer commented on GitHub (Mar 18, 2025):

Server logs will aid in debugging.

Debug logging: https://gist.github.com/ronaldvdmeer/5dab71e495370ee96aa22798b0c79a9e

<!-- gh-comment-id:2733850704 --> @ronaldvdmeer commented on GitHub (Mar 18, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging. Debug logging: [https://gist.github.com/ronaldvdmeer/5dab71e495370ee96aa22798b0c79a9e](https://gist.github.com/ronaldvdmeer/5dab71e495370ee96aa22798b0c79a9e)
Author
Owner

@mmb78 commented on GitHub (Mar 19, 2025):

Not sure if this is related .. but I'm analyzing a long list of images that I feed by a python script using OpenAI API to gemma3 models and I have observed several times that after about 100 prompts the analysis time doubles (for 27 and 12b models, q8 and q4 tested).

<!-- gh-comment-id:2736706756 --> @mmb78 commented on GitHub (Mar 19, 2025): Not sure if this is related .. but I'm analyzing a long list of images that I feed by a python script using OpenAI API to gemma3 models and I have observed several times that after about 100 prompts the analysis time doubles (for 27 and 12b models, q8 and q4 tested).
Author
Owner

@ronaldvdmeer commented on GitHub (Mar 19, 2025):

Update, after changing some settings in de systemd script the model keeps running as expected

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MODELS=/data/OllamaModels"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_KEEP_ALIVE=-1"
#Environment="OLLAMA_NO_CPU_FALLBACK=1"
#Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_DEBUG=1"

[Install]
WantedBy=default.target

I commented out OLLAMA_NO_CPU_FALLBACK=1 and OLLAMA_FLASH_ATTENTION=1

<!-- gh-comment-id:2737219877 --> @ronaldvdmeer commented on GitHub (Mar 19, 2025): Update, after changing some settings in de systemd script the model keeps running as expected ``` [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" Environment="OLLAMA_HOST=0.0.0.0" Environment="OLLAMA_MODELS=/data/OllamaModels" Environment="OLLAMA_ORIGINS=*" Environment="OLLAMA_KEEP_ALIVE=-1" #Environment="OLLAMA_NO_CPU_FALLBACK=1" #Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_DEBUG=1" [Install] WantedBy=default.target ``` I commented out `OLLAMA_NO_CPU_FALLBACK=1` and `OLLAMA_FLASH_ATTENTION=1`
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

OLLAMA_NO_CPU_FALLBACK is not an ollama configuration variable, so FA would seem to be the culprit.

<!-- gh-comment-id:2737245375 --> @rick-github commented on GitHub (Mar 19, 2025): `OLLAMA_NO_CPU_FALLBACK` is not an ollama configuration variable, so FA would seem to be the culprit.
Author
Owner

@ronaldvdmeer commented on GitHub (Mar 19, 2025):

OLLAMA_NO_CPU_FALLBACK is not an ollama configuration variable, so FA would seem to be the culprit.

Indeed. Is FA something that should work with this model?

<!-- gh-comment-id:2737255248 --> @ronaldvdmeer commented on GitHub (Mar 19, 2025): > `OLLAMA_NO_CPU_FALLBACK` is not an ollama configuration variable, so FA would seem to be the culprit. Indeed. Is FA something that should work with this model?
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

Mar 18 17:00:56 su8ai01 ollama[267855]: time=2025-03-18T17:00:56.939+01:00 level=INFO source=server.go:185 msg="enabling flash attention"
Mar 18 17:00:56 su8ai01 ollama[267855]: time=2025-03-18T17:00:56.939+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type=""

Perhaps try settingOLLAMA_KV_CACHE_TYPE to one of q4_0, q8_0 or fp16. Having said that, gemma3 is having other problems with FA (https://github.com/ollama/ollama/issues/9683) so it might be resolved as part of that issue.

<!-- gh-comment-id:2737282395 --> @rick-github commented on GitHub (Mar 19, 2025): ``` Mar 18 17:00:56 su8ai01 ollama[267855]: time=2025-03-18T17:00:56.939+01:00 level=INFO source=server.go:185 msg="enabling flash attention" Mar 18 17:00:56 su8ai01 ollama[267855]: time=2025-03-18T17:00:56.939+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type="" ``` Perhaps try setting`OLLAMA_KV_CACHE_TYPE` to one of q4_0, q8_0 or fp16. Having said that, gemma3 is having other problems with FA (https://github.com/ollama/ollama/issues/9683) so it might be resolved as part of that issue.
Author
Owner

@ronaldvdmeer commented on GitHub (Mar 19, 2025):

I thought we had found the problem but it's still very unstable after a few prompts. Sometimes the cpu goes to 100% and nothing happens. Sometimes the proces memory spikes. Not sure. Are you guys able to replicate these stability issues?

<!-- gh-comment-id:2738434929 --> @ronaldvdmeer commented on GitHub (Mar 19, 2025): I thought we had found the problem but it's still very unstable after a few prompts. Sometimes the cpu goes to 100% and nothing happens. Sometimes the proces memory spikes. Not sure. Are you guys able to replicate these stability issues?
Author
Owner

@Ducky6944 commented on GitHub (Mar 20, 2025):

I thought we had found the problem but it's still very unstable after a few prompts. Sometimes the cpu goes to 100% and nothing happens. Sometimes the proces memory spikes. Not sure. Are you guys able to replicate these stability issues?

I am having similar issues using the gemma3:27b tag. Recently when it happened to me I had nvtop open and it showed that memory had gone to 0 and that the process wasn't showing. When I asked it another question. Memory and cpu ramped back up but it never generated anything. I'm running in docker. Not sure if that was related it or not.

<!-- gh-comment-id:2739032555 --> @Ducky6944 commented on GitHub (Mar 20, 2025): > I thought we had found the problem but it's still very unstable after a few prompts. Sometimes the cpu goes to 100% and nothing happens. Sometimes the proces memory spikes. Not sure. Are you guys able to replicate these stability issues? I am having similar issues using the `gemma3:27b` tag. Recently when it happened to me I had nvtop open and it showed that memory had gone to 0 and that the process wasn't showing. When I asked it another question. Memory and cpu ramped back up but it never generated anything. I'm running in docker. Not sure if that was related it or not.
Author
Owner

@ronaldvdmeer commented on GitHub (Mar 20, 2025):

I've experimented with unloading the model after 30 seconds. This works for a couple of prompts and then this happens:

Image

root@su8ai01:~# ps aux  | grep ollama
ollama    620817 31.7  0.4 6935304 238816 ?      Ssl  09:44   3:29 /usr/local/bin/ollama serve
ronald    620834  0.1  0.0 1927064 31484 pts/5   Sl+  09:44   0:01 ollama run gemma3:27b-it-q4_K_M
ollama    622325 25.7  8.0 69847124 3949532 ?    Sl   09:49   1:31 /usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 8192 --batch-size 512 --n-gpu-layers 63 --threads 4 --parallel 4 --port 35997
Mar 20 09:42:20 su8ai01 systemd[1]: Stopped ollama.service - Ollama Service.
Mar 20 09:42:20 su8ai01 systemd[1]: ollama.service: Consumed 9h 11min 29.121s CPU time.
Mar 20 09:44:12 su8ai01 systemd[1]: Started ollama.service - Ollama Service.
Mar 20 09:44:12 su8ai01 ollama[620817]: 2025/03/20 09:44:12 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAM>
Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.543+01:00 level=INFO source=images.go:432 msg="total blobs: 30"
Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.543+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.544+01:00 level=INFO source=routes.go:1297 msg="Listening on [::]:11434 (version 0.6.2)"
Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.544+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.854+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" availabl>
Mar 20 09:44:36 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:36 | 200 |       48.14µs |       127.0.0.1 | HEAD     "/"
Mar 20 09:44:36 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:36 | 200 |   46.067577ms |       127.0.0.1 | POST     "/api/show"
Mar 20 09:44:36 su8ai01 ollama[620817]: time=2025-03-20T09:44:36.882+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.139+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.6 GiB" free_swap="0 B"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.140+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.278+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.280+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.283+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.308+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.309+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:39227"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.540+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices:
Mar 20 09:44:37 su8ai01 ollama[620817]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 20 09:44:37 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 20 09:44:37 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.565+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9>
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.693+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.693+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.991+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 20 09:44:41 su8ai01 ollama[620817]: time=2025-03-20T09:44:41.530+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.901+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.902+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.902+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.904+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.907+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:44:43 su8ai01 ollama[620817]: time=2025-03-20T09:44:43.036+01:00 level=INFO source=server.go:619 msg="llama runner started in 5.75 seconds"
Mar 20 09:44:43 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:43 | 200 |   6.57078618s |       127.0.0.1 | POST     "/api/generate"
Mar 20 09:45:02 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:45:02 | 200 |  6.351604303s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.439+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=>
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.664+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.666+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=>
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.770+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.772+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.775+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6>
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.781+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.789+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.790+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:42871"
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices:
Mar 20 09:45:40 su8ai01 ollama[620817]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 20 09:45:40 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 20 09:45:40 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.932+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9>
Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.032+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.042+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.043+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.486+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 20 09:45:44 su8ai01 ollama[620817]: time=2025-03-20T09:45:44.061+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.227+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.227+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.228+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.230+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.233+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.316+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.54 seconds"
Mar 20 09:45:59 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:45:59 | 200 |  19.42089416s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:46:34 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:46:34 | 200 | 15.014740731s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:47:07 su8ai01 ollama[620817]: time=2025-03-20T09:47:07.790+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=>
Mar 20 09:47:07 su8ai01 ollama[620817]: time=2025-03-20T09:47:07.999+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.000+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=>
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.107+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.109+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.112+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6>
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.140+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.141+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:41231"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices:
Mar 20 09:47:08 su8ai01 ollama[620817]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 20 09:47:08 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 20 09:47:08 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.311+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9>
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.370+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.415+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.415+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.821+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 20 09:47:11 su8ai01 ollama[620817]: time=2025-03-20T09:47:11.392+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.717+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.717+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.718+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.720+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.723+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.901+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.78 seconds"
Mar 20 09:47:29 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:47:29 | 200 | 21.744458297s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.770+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=>
Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.996+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B"
Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.998+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=>
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.101+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.104+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.106+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6>
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.114+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.133+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.133+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:39309"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices:
Mar 20 09:48:10 su8ai01 ollama[620817]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 20 09:48:10 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 20 09:48:10 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.292+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9>
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.365+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.416+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.416+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.821+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 20 09:48:13 su8ai01 ollama[620817]: time=2025-03-20T09:48:13.338+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.596+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.599+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.604+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.844+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.73 seconds"
Mar 20 09:48:34 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:48:34 | 200 | 25.529366119s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.342+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.551+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.4 GiB" free_swap="0 B"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.553+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.662+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.664+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.666+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.680+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.683+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.684+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:35997"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices:
Mar 20 09:49:15 su8ai01 ollama[620817]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Mar 20 09:49:15 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 20 09:49:15 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.846+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9>
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.933+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.947+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.947+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
Mar 20 09:49:16 su8ai01 ollama[620817]: time=2025-03-20T09:49:16.386+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
Mar 20 09:49:18 su8ai01 ollama[620817]: time=2025-03-20T09:49:18.936+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.059+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.061+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n>
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.191+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.52 seconds"
Mar 20 09:49:37 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:49:37 | 200 | 22.737085656s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:50:24 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:50:24 | 200 | 24.209365834s |       127.0.0.1 | POST     "/api/chat"
Mar 20 09:51:20 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:51:20 | 200 | 30.315333068s |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:2739647090 --> @ronaldvdmeer commented on GitHub (Mar 20, 2025): I've experimented with unloading the model after 30 seconds. This works for a couple of prompts and then this happens: ![Image](https://github.com/user-attachments/assets/a2c9de9e-2982-4122-b555-9138b380e4f6) ``` root@su8ai01:~# ps aux | grep ollama ollama 620817 31.7 0.4 6935304 238816 ? Ssl 09:44 3:29 /usr/local/bin/ollama serve ronald 620834 0.1 0.0 1927064 31484 pts/5 Sl+ 09:44 0:01 ollama run gemma3:27b-it-q4_K_M ollama 622325 25.7 8.0 69847124 3949532 ? Sl 09:49 1:31 /usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 8192 --batch-size 512 --n-gpu-layers 63 --threads 4 --parallel 4 --port 35997 ``` ``` Mar 20 09:42:20 su8ai01 systemd[1]: Stopped ollama.service - Ollama Service. Mar 20 09:42:20 su8ai01 systemd[1]: ollama.service: Consumed 9h 11min 29.121s CPU time. Mar 20 09:44:12 su8ai01 systemd[1]: Started ollama.service - Ollama Service. Mar 20 09:44:12 su8ai01 ollama[620817]: 2025/03/20 09:44:12 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAM> Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.543+01:00 level=INFO source=images.go:432 msg="total blobs: 30" Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.543+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.544+01:00 level=INFO source=routes.go:1297 msg="Listening on [::]:11434 (version 0.6.2)" Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.544+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Mar 20 09:44:12 su8ai01 ollama[620817]: time=2025-03-20T09:44:12.854+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-8a4489d7-2dab-940d-5a74-45dee3662627 library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" availabl> Mar 20 09:44:36 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:36 | 200 | 48.14µs | 127.0.0.1 | HEAD "/" Mar 20 09:44:36 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:36 | 200 | 46.067577ms | 127.0.0.1 | POST "/api/show" Mar 20 09:44:36 su8ai01 ollama[620817]: time=2025-03-20T09:44:36.882+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.139+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.6 GiB" free_swap="0 B" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.140+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.278+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.280+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.283+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.288+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.289+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.308+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.309+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:39227" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.454+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.540+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 20 09:44:37 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices: Mar 20 09:44:37 su8ai01 ollama[620817]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 20 09:44:37 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 20 09:44:37 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.565+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9> Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.693+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.693+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 20 09:44:37 su8ai01 ollama[620817]: time=2025-03-20T09:44:37.991+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 20 09:44:41 su8ai01 ollama[620817]: time=2025-03-20T09:44:41.530+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.901+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.902+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.902+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.904+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.907+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:44:42 su8ai01 ollama[620817]: time=2025-03-20T09:44:42.911+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:44:43 su8ai01 ollama[620817]: time=2025-03-20T09:44:43.036+01:00 level=INFO source=server.go:619 msg="llama runner started in 5.75 seconds" Mar 20 09:44:43 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:44:43 | 200 | 6.57078618s | 127.0.0.1 | POST "/api/generate" Mar 20 09:45:02 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:45:02 | 200 | 6.351604303s | 127.0.0.1 | POST "/api/chat" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.439+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=> Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.664+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.666+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=> Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.770+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.772+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.775+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6> Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.780+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.781+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.789+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.790+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:42871" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.893+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 20 09:45:40 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices: Mar 20 09:45:40 su8ai01 ollama[620817]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 20 09:45:40 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 20 09:45:40 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 20 09:45:40 su8ai01 ollama[620817]: time=2025-03-20T09:45:40.932+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9> Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.032+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.042+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.043+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 20 09:45:41 su8ai01 ollama[620817]: time=2025-03-20T09:45:41.486+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 20 09:45:44 su8ai01 ollama[620817]: time=2025-03-20T09:45:44.061+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.227+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.227+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.228+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.230+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.233+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.237+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:45:45 su8ai01 ollama[620817]: time=2025-03-20T09:45:45.316+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.54 seconds" Mar 20 09:45:59 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:45:59 | 200 | 19.42089416s | 127.0.0.1 | POST "/api/chat" Mar 20 09:46:34 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:46:34 | 200 | 15.014740731s | 127.0.0.1 | POST "/api/chat" Mar 20 09:47:07 su8ai01 ollama[620817]: time=2025-03-20T09:47:07.790+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=> Mar 20 09:47:07 su8ai01 ollama[620817]: time=2025-03-20T09:47:07.999+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.000+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=> Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.107+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.109+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.112+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.118+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6> Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.119+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.140+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.141+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:41231" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.254+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 20 09:47:08 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices: Mar 20 09:47:08 su8ai01 ollama[620817]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 20 09:47:08 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 20 09:47:08 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.311+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9> Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.370+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.415+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.415+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 20 09:47:08 su8ai01 ollama[620817]: time=2025-03-20T09:47:08.821+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 20 09:47:11 su8ai01 ollama[620817]: time=2025-03-20T09:47:11.392+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.717+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.717+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.718+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.720+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.723+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.727+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:47:12 su8ai01 ollama[620817]: time=2025-03-20T09:47:12.901+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.78 seconds" Mar 20 09:47:29 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:47:29 | 200 | 21.744458297s | 127.0.0.1 | POST "/api/chat" Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.770+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=> Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.996+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.5 GiB" free_swap="0 B" Mar 20 09:48:09 su8ai01 ollama[620817]: time=2025-03-20T09:48:09.998+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=> Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.101+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.104+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.106+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6> Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.113+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.114+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.133+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.133+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:39309" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.235+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 20 09:48:10 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices: Mar 20 09:48:10 su8ai01 ollama[620817]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 20 09:48:10 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 20 09:48:10 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.292+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9> Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.365+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.416+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.416+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 20 09:48:10 su8ai01 ollama[620817]: time=2025-03-20T09:48:10.821+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 20 09:48:13 su8ai01 ollama[620817]: time=2025-03-20T09:48:13.338+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.594+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.596+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.599+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.603+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.604+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:48:14 su8ai01 ollama[620817]: time=2025-03-20T09:48:14.844+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.73 seconds" Mar 20 09:48:34 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:48:34 | 200 | 25.529366119s | 127.0.0.1 | POST "/api/chat" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.342+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.551+01:00 level=INFO source=server.go:105 msg="system memory" total="47.0 GiB" free="44.4 GiB" free_swap="0 B" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.553+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full=> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.662+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.664+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.666+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.672+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/OllamaModels/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f6> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.673+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.680+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.683+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.684+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:35997" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.791+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 20 09:49:15 su8ai01 ollama[620817]: ggml_cuda_init: found 1 CUDA devices: Mar 20 09:49:15 su8ai01 ollama[620817]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Mar 20 09:49:15 su8ai01 ollama[620817]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 20 09:49:15 su8ai01 ollama[620817]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.846+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,9> Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.933+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.947+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" Mar 20 09:49:15 su8ai01 ollama[620817]: time=2025-03-20T09:49:15.947+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" Mar 20 09:49:16 su8ai01 ollama[620817]: time=2025-03-20T09:49:16.386+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" Mar 20 09:49:18 su8ai01 ollama[620817]: time=2025-03-20T09:49:18.936+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.056+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.059+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.061+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n> Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.066+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 20 09:49:20 su8ai01 ollama[620817]: time=2025-03-20T09:49:20.191+01:00 level=INFO source=server.go:619 msg="llama runner started in 4.52 seconds" Mar 20 09:49:37 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:49:37 | 200 | 22.737085656s | 127.0.0.1 | POST "/api/chat" Mar 20 09:50:24 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:50:24 | 200 | 24.209365834s | 127.0.0.1 | POST "/api/chat" Mar 20 09:51:20 su8ai01 ollama[620817]: [GIN] 2025/03/20 - 09:51:20 | 200 | 30.315333068s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@ronaldvdmeer commented on GitHub (Mar 26, 2025):

Am I the only one?

<!-- gh-comment-id:2755071429 --> @ronaldvdmeer commented on GitHub (Mar 26, 2025): Am I the only one?
Author
Owner

@rick-github commented on GitHub (Mar 26, 2025):

It seems so.

It might help if we can pinpoint where in the call chain the stalling happens. The server and the runner communicate via the port number specified by --port. If you start a listener on that port:

tcpflow -c -i lo port $PORT

and use the client to send requests, does traffic on the port stop when the model stops responding?

<!-- gh-comment-id:2755097186 --> @rick-github commented on GitHub (Mar 26, 2025): It seems so. It might help if we can pinpoint where in the call chain the stalling happens. The server and the runner communicate via the port number specified by `--port`. If you start a listener on that port: ```consoole tcpflow -c -i lo port $PORT ``` and use the client to send requests, does traffic on the port stop when the model stops responding?
Author
Owner

@Ducky6944 commented on GitHub (Mar 26, 2025):

Whats the output of nvidia-smi -q. In my case my issue seems to have been caused by using an "unlicensed" VGPU. I haven't run into this since resolving the licensing issue. Despite the fact it was reporting as unrestricted, which in my mind I took at face value.

<!-- gh-comment-id:2755146850 --> @Ducky6944 commented on GitHub (Mar 26, 2025): Whats the output of `nvidia-smi -q.` In my case my issue seems to have been caused by using an "unlicensed" VGPU. I haven't run into this since resolving the licensing issue. Despite the fact it was reporting as `unrestricted`, which in my mind I took at face value.
Author
Owner

@ronaldvdmeer commented on GitHub (Mar 26, 2025):

It seems so.

It might help if we can pinpoint where in the call chain the stalling happens. The server and the runner communicate via the port number specified by --port. If you start a listener on that port:

tcpflow -c -i lo port $PORT

and use the client to send requests, does traffic on the port stop when the model stops responding?

https://pastebin.com/raw/4FyGnHuQ
With the last 3 or 4 prompts there was no response from the model anymore. Traffic pattern changed and was very limited. Not sure what that means.

nvidia-smi -q

Here is the output. https://pastebin.com/raw/hNPC2dvT

<!-- gh-comment-id:2755389923 --> @ronaldvdmeer commented on GitHub (Mar 26, 2025): > It seems so. > > It might help if we can pinpoint where in the call chain the stalling happens. The server and the runner communicate via the port number specified by `--port`. If you start a listener on that port: > > ``` > tcpflow -c -i lo port $PORT > ``` > > and use the client to send requests, does traffic on the port stop when the model stops responding? [https://pastebin.com/raw/4FyGnHuQ](https://pastebin.com/raw/4FyGnHuQ) With the last 3 or 4 prompts there was no response from the model anymore. Traffic pattern changed and was very limited. Not sure what that means. > nvidia-smi -q Here is the output. [https://pastebin.com/raw/hNPC2dvT](https://pastebin.com/raw/hNPC2dvT)
Author
Owner

@ronaldvdmeer commented on GitHub (Mar 26, 2025):

After extensive testing, I found that the GPU instability I was experiencing when running Ollama inside a Proxmox VM with passthrough was likely caused by a virtual IOMMU (vIOMMU) being presented to the VM.

Even though I hadn’t explicitly enabled IOMMU support inside the guest OS, the Proxmox configuration was set to expose a virtual IOMMU. The Linux kernel inside the VM automatically detected and enabled it, which I believe led to unexpected behavior with CUDA workloads.

Once I disabled the virtual IOMMU in the VM configuration, all instability disappeared. The system has been completely stable since, and performance is consistent.

Does this seem logical?

<!-- gh-comment-id:2755957109 --> @ronaldvdmeer commented on GitHub (Mar 26, 2025): After extensive testing, I found that the GPU instability I was experiencing when running Ollama inside a Proxmox VM with passthrough was likely caused by a virtual IOMMU (vIOMMU) being presented to the VM. Even though I hadn’t explicitly enabled IOMMU support inside the guest OS, the Proxmox configuration was set to expose a virtual IOMMU. The Linux kernel inside the VM automatically detected and enabled it, which I believe led to unexpected behavior with CUDA workloads. Once I disabled the virtual IOMMU in the VM configuration, all instability disappeared. The system has been completely stable since, and performance is consistent. Does this seem logical?
Author
Owner

@rick-github commented on GitHub (Mar 26, 2025):

Sounds logical. There are other reports of IOMMU settings causing issues, but usually by generating random tokens, not getting wedged. Let it bake for a bit and if you feel the problem is resolved, close the ticket, and feel free to re-open if the problem re-occurs.

<!-- gh-comment-id:2755972229 --> @rick-github commented on GitHub (Mar 26, 2025): Sounds logical. There are other reports of IOMMU settings causing issues, but usually by generating random tokens, not getting wedged. Let it bake for a bit and if you feel the problem is resolved, close the ticket, and feel free to re-open if the problem re-occurs.
Author
Owner

@Johnno1011 commented on GitHub (Mar 27, 2025):

Hey all.
Think this issue is the most related to what I'm experiencing with gemma3:27b.

I'm seeing that with running the model entirely on CPU I'm unable to get any response at all. If I wait long enough in the UI, I get tokens but it's EXTREMELY slow. My machine also goes flat out, for seemingly little output. See attached screenshot. I have tried turning off quantisation and flash attention etc but this has no affect.
Cheers.

Image
<!-- gh-comment-id:2758147993 --> @Johnno1011 commented on GitHub (Mar 27, 2025): Hey all. Think this issue is the most related to what I'm experiencing with gemma3:27b. I'm seeing that with running the model entirely on CPU I'm unable to get any response at all. If I wait long enough in the UI, I get tokens but it's EXTREMELY slow. My machine also goes flat out, for seemingly little output. See attached screenshot. I have tried turning off quantisation and flash attention etc but this has no affect. Cheers. <img width="945" alt="Image" src="https://github.com/user-attachments/assets/7e1e82a8-c1e0-4fd0-a1ba-38ca70e5dbe6" />
Author
Owner

@rick-github commented on GitHub (Mar 27, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:2758350453 --> @rick-github commented on GitHub (Mar 27, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@Johnno1011 commented on GitHub (Mar 27, 2025):

Server startup logs:

time=2025-03-27T15:04:48.367Z level=INFO source=server.go:105 msg="system memory" total="125.8 GiB" free="95.4 GiB" free_swap="0 B"
time=2025-03-27T15:04:48.369Z level=INFO source=server.go:138 msg=offload library=cpu layers.requested=-1 layers.model=63 layers.offload=0 layers.split="" memory.available="[95.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.9 GiB" memory.required.partial="0 B" memory.required.kv="992.0 MiB" memory.required.allocations="[18.9 GiB]" memory.weights.total="14.3 GiB" memory.weights.repeating="14.3 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="522.5 MiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-03-27T15:04:48.369Z level=WARN source=server.go:196 msg="quantized kv cache requested but flash attention disabled" type=f32
time=2025-03-27T15:04:48.467Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-27T15:04:48.471Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-27T15:04:48.474Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-27T15:04:48.482Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /mnt/dsdrive/blobs/sha256-e796792eba26c4d3b04b0ac5adb01a453dd9ec2dfd83b6c59cbf6fe5f30b0f68 --ctx-size 2048 --batch-size 512 --threads 64 --no-mmap --parallel 1 --port 40519"
time=2025-03-27T15:04:48.482Z level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-27T15:04:48.482Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-03-27T15:04:48.483Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-03-27T15:04:48.496Z level=INFO source=runner.go:763 msg="starting ollama engine"
time=2025-03-27T15:04:48.496Z level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:40519"
time=2025-03-27T15:04:48.595Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-03-27T15:04:48.595Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-03-27T15:04:48.595Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=37
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-03-27T15:04:48.600Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-03-27T15:04:48.605Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="17.3 GiB"
time=2025-03-27T15:04:48.750Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
time=2025-03-27T15:04:54.129Z level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CPU
time=2025-03-27T15:04:54.129Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-27T15:04:54.133Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-27T15:04:54.137Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-27T15:04:54.257Z level=INFO source=server.go:619 msg="llama runner started in 5.77 seconds"
<!-- gh-comment-id:2758427443 --> @Johnno1011 commented on GitHub (Mar 27, 2025): Server startup logs: ``` time=2025-03-27T15:04:48.367Z level=INFO source=server.go:105 msg="system memory" total="125.8 GiB" free="95.4 GiB" free_swap="0 B" time=2025-03-27T15:04:48.369Z level=INFO source=server.go:138 msg=offload library=cpu layers.requested=-1 layers.model=63 layers.offload=0 layers.split="" memory.available="[95.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.9 GiB" memory.required.partial="0 B" memory.required.kv="992.0 MiB" memory.required.allocations="[18.9 GiB]" memory.weights.total="14.3 GiB" memory.weights.repeating="14.3 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="522.5 MiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-03-27T15:04:48.369Z level=WARN source=server.go:196 msg="quantized kv cache requested but flash attention disabled" type=f32 time=2025-03-27T15:04:48.467Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-27T15:04:48.471Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-27T15:04:48.474Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-27T15:04:48.482Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-27T15:04:48.482Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /mnt/dsdrive/blobs/sha256-e796792eba26c4d3b04b0ac5adb01a453dd9ec2dfd83b6c59cbf6fe5f30b0f68 --ctx-size 2048 --batch-size 512 --threads 64 --no-mmap --parallel 1 --port 40519" time=2025-03-27T15:04:48.482Z level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-27T15:04:48.482Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-03-27T15:04:48.483Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-03-27T15:04:48.496Z level=INFO source=runner.go:763 msg="starting ollama engine" time=2025-03-27T15:04:48.496Z level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:40519" time=2025-03-27T15:04:48.595Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-03-27T15:04:48.595Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-03-27T15:04:48.595Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=37 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-03-27T15:04:48.600Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-03-27T15:04:48.605Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="17.3 GiB" time=2025-03-27T15:04:48.750Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" time=2025-03-27T15:04:54.129Z level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CPU time=2025-03-27T15:04:54.129Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-27T15:04:54.133Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-27T15:04:54.137Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-27T15:04:54.144Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-27T15:04:54.257Z level=INFO source=server.go:619 msg="llama runner started in 5.77 seconds" ```
Author
Owner

@Johnno1011 commented on GitHub (Mar 27, 2025):

Sorry just to add to this, quick rundown is I'm running on CPU only here and have found the model is way way slower than I would expect. Putting the model on the GPU did sort the issue out as of course it's quicker. Anyway, feel free to discard my points as the CPU speed issue is something unrelated to this specific ticket :)

<!-- gh-comment-id:2758503120 --> @Johnno1011 commented on GitHub (Mar 27, 2025): Sorry just to add to this, quick rundown is I'm running on CPU only here and have found the model is way way slower than I would expect. Putting the model on the GPU did sort the issue out as of course it's quicker. Anyway, feel free to discard my points as the CPU speed issue is something unrelated to this specific ticket :)
Author
Owner

@rick-github commented on GitHub (Mar 27, 2025):

When running on CPU, the constraint for large models is memory bandwidth. gemma3:27b wants about 19G of RAM. My test system has 5600Mhz DDR-5 RAM, with a memory bandwidth of about 69GB/s. That means the model can generate at most 69/19 tps or 3.6 tps. The output of ollama run --verbose shows 3.42 tps so close to the max. If you are seeing slow inference on the CPU, your memory bandwidth is probably lower. I see that your CPU is Haswell architecture, doing a bit of a search it looks like DDR-3 and DDR-4 memory subsytems were common for Haswell, with a max memory bandwidth of 25G/s with 3200Mhz DDR-4.

Everything else in the log looks normal, so i think this is just a case of hardware limitations.

<!-- gh-comment-id:2758581695 --> @rick-github commented on GitHub (Mar 27, 2025): When running on CPU, the constraint for large models is memory bandwidth. gemma3:27b wants about 19G of RAM. My test system has 5600Mhz DDR-5 RAM, with a memory bandwidth of about 69GB/s. That means the model can generate at most 69/19 tps or 3.6 tps. The output of `ollama run --verbose` shows 3.42 tps so close to the max. If you are seeing slow inference on the CPU, your memory bandwidth is probably lower. I see that your CPU is Haswell architecture, doing a bit of a search it looks like DDR-3 and DDR-4 memory subsytems were common for Haswell, with a max memory bandwidth of 25G/s with 3200Mhz DDR-4. Everything else in the log looks normal, so i think this is just a case of hardware limitations.
Author
Owner

@Johnno1011 commented on GitHub (Mar 27, 2025):

When running on CPU, the constraint for large models is memory bandwidth. gemma3:27b wants about 19G of RAM. My test system has 5600Mhz DDR-5 RAM, with a memory bandwidth of about 69GB/s. That means the model can generate at most 69/19 tps or 3.6 tps. The output of ollama run --verbose shows 3.42 tps so close to the max. If you are seeing slow inference on the CPU, your memory bandwidth is probably lower. I see that your CPU is Haswell architecture, doing a bit of a search it looks like DDR-3 and DDR-4 memory subsytems were common for Haswell, with a max memory bandwidth of 25G/s with 3200Mhz DDR-4.

Everything else in the log looks normal, so i think this is just a case of hardware limitations.

This is really useful information, thank you!
Based on my system, I have approx 24GB/s of DDR4 bandwidth. 24/19 = 1.26 tk/s. I am not getting this. I am running on a 64 core CPU machine. Any thoughts? Thanks

<!-- gh-comment-id:2758781486 --> @Johnno1011 commented on GitHub (Mar 27, 2025): > When running on CPU, the constraint for large models is memory bandwidth. gemma3:27b wants about 19G of RAM. My test system has 5600Mhz DDR-5 RAM, with a memory bandwidth of about 69GB/s. That means the model can generate at most 69/19 tps or 3.6 tps. The output of `ollama run --verbose` shows 3.42 tps so close to the max. If you are seeing slow inference on the CPU, your memory bandwidth is probably lower. I see that your CPU is Haswell architecture, doing a bit of a search it looks like DDR-3 and DDR-4 memory subsytems were common for Haswell, with a max memory bandwidth of 25G/s with 3200Mhz DDR-4. > > Everything else in the log looks normal, so i think this is just a case of hardware limitations. This is really useful information, thank you! Based on my system, I have approx 24GB/s of DDR4 bandwidth. 24/19 = 1.26 tk/s. I am not getting this. I am running on a 64 core CPU machine. Any thoughts? Thanks
Author
Owner

@rick-github commented on GitHub (Mar 27, 2025):

Any thoughts?

Possibly thread contention or cache thrashing. What happens if you reduce the number of threads. Try 1 and 31, if there's a change in performance do a binary search to find the best value:

$ ollama run gemma3:27b --verbose
>>> hello
Hello there! 👋 

How can I help you today? Just let me know what you're thinking, or if you just wanted to say hi, that's great too! 

I can:

* **Answer questions:** About pretty much anything!
* **Generate creative content:** Like stories, poems, code, scripts, musical pieces, email, letters, etc.
* **Translate languages.**
* **Summarize text.**
* **Help with brainstorming.**
* **Just chat!**


total duration:       33.118166262s
load duration:        54.58724ms
prompt eval count:    10 token(s)
prompt eval duration: 1.410197875s
prompt eval rate:     7.09 tokens/s
eval count:           108 token(s)
eval duration:        31.652382454s
eval rate:            3.41 tokens/s
>>> /set parameter num_thread 1
Set parameter 'num_thread' to '1'
>>> say hello, concisely
Hello! 👋


total duration:       1m46.132958402s
load duration:        3.663317777s
prompt eval count:    131 token(s)
prompt eval duration: 1m38.085952912s
prompt eval rate:     1.34 tokens/s
eval count:           5 token(s)
eval duration:        4.365867083s
eval rate:            1.15 tokens/s
>>> /set parameter num_thread 31
Set parameter 'num_thread' to '31'
>>> say hello, concisely
Hi! 😊


total duration:       2m9.401733529s
load duration:        3.883104395s
prompt eval count:    149 token(s)
prompt eval duration: 1m2.323025105s
prompt eval rate:     2.39 tokens/s
eval count:           5 token(s
eval duration:        1m3.166008742s
eval rate:            0.08 tokens/s
<!-- gh-comment-id:2758868781 --> @rick-github commented on GitHub (Mar 27, 2025): > Any thoughts? Possibly thread contention or cache thrashing. What happens if you reduce the number of threads. Try 1 and 31, if there's a change in performance do a binary search to find the best value: ```console $ ollama run gemma3:27b --verbose >>> hello Hello there! 👋 How can I help you today? Just let me know what you're thinking, or if you just wanted to say hi, that's great too! I can: * **Answer questions:** About pretty much anything! * **Generate creative content:** Like stories, poems, code, scripts, musical pieces, email, letters, etc. * **Translate languages.** * **Summarize text.** * **Help with brainstorming.** * **Just chat!** total duration: 33.118166262s load duration: 54.58724ms prompt eval count: 10 token(s) prompt eval duration: 1.410197875s prompt eval rate: 7.09 tokens/s eval count: 108 token(s) eval duration: 31.652382454s eval rate: 3.41 tokens/s ``` ```console >>> /set parameter num_thread 1 Set parameter 'num_thread' to '1' >>> say hello, concisely Hello! 👋 total duration: 1m46.132958402s load duration: 3.663317777s prompt eval count: 131 token(s) prompt eval duration: 1m38.085952912s prompt eval rate: 1.34 tokens/s eval count: 5 token(s) eval duration: 4.365867083s eval rate: 1.15 tokens/s ``` ```console >>> /set parameter num_thread 31 Set parameter 'num_thread' to '31' >>> say hello, concisely Hi! 😊 total duration: 2m9.401733529s load duration: 3.883104395s prompt eval count: 149 token(s) prompt eval duration: 1m2.323025105s prompt eval rate: 2.39 tokens/s eval count: 5 token(s eval duration: 1m3.166008742s eval rate: 0.08 tokens/s ```
Author
Owner

@ronaldvdmeer commented on GitHub (Mar 27, 2025):

Sounds logical. There are other reports of IOMMU settings causing issues, but usually by generating random tokens, not getting wedged. Let it bake for a bit and if you feel the problem is resolved, close the ticket, and feel free to re-open if the problem re-occurs.

Just wanted to confirm that after another full day of heavy use, the system remains completely stable. I haven’t had to reboot even once — which is a first since I started using GPU passthrough with Ollama. Disabling the virtual IOMMU in the VM configuration clearly solved it for me.

I realize the thread now includes a separate issue related to CPU behavior, which may be unrelated to this specific passthrough problem. Hopefully, this update still helps others who run into similar stability problems with NVIDIA GPUs and virtual IOMMUs in Proxmox environments.

<!-- gh-comment-id:2759478646 --> @ronaldvdmeer commented on GitHub (Mar 27, 2025): > Sounds logical. There are other reports of IOMMU settings causing issues, but usually by generating random tokens, not getting wedged. Let it bake for a bit and if you feel the problem is resolved, close the ticket, and feel free to re-open if the problem re-occurs. Just wanted to confirm that after another full day of heavy use, the system remains completely stable. I haven’t had to reboot even once — which is a first since I started using GPU passthrough with Ollama. Disabling the virtual IOMMU in the VM configuration clearly solved it for me. I realize the thread now includes a separate issue related to CPU behavior, which may be unrelated to this specific passthrough problem. Hopefully, this update still helps others who run into similar stability problems with NVIDIA GPUs and virtual IOMMUs in Proxmox environments.
Author
Owner

@ronaldvdmeer commented on GitHub (Mar 27, 2025):

Thanks everyone — I’m going to close this issue since the original GPU passthrough instability is now fully resolved.
For those experiencing unrelated issues (like CPU behavior), it’s probably best to open a separate ticket to keep things focused.

<!-- gh-comment-id:2759479645 --> @ronaldvdmeer commented on GitHub (Mar 27, 2025): Thanks everyone — I’m going to close this issue since the original GPU passthrough instability is now fully resolved. For those experiencing unrelated issues (like CPU behavior), it’s probably best to open a separate ticket to keep things focused.
Author
Owner

@rick-github commented on GitHub (Mar 27, 2025):

Thanks @ronaldvdmeer , these corner cases are always a bugbear and I'm glad it's been resolved.

<!-- gh-comment-id:2759482295 --> @rick-github commented on GitHub (Mar 27, 2025): Thanks @ronaldvdmeer , these corner cases are always a bugbear and I'm glad it's been resolved.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52966