[GH-ISSUE #10553] mistral-small3.1:24b-instruct-2503 architecture mistral3? #53457

Closed
opened 2026-04-29 03:15:55 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @MarkWard0110 on GitHub (May 3, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10553

What is the issue?

I need help understanding the differences I am finding when running mistral-small3.1:24b-instruct-2503

Ollama's model repo mistral-small3.1:24b-instruct-2503 has model architecture mistral3.
The previous mistral-small:24b-instruct-2501 model architecture was llama
The hugging face hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M has model architecture llama

The mistral-small3.1:24b-instruct-2503 appears to be slower and requires more RAM for the same context sizes. However, hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M seems to be closer to mistral-small:24b-instruct-2501

I have the following metrics recorded for the model, time, and tokens per second.

The new model using mistral3 is significantly slower than the previous release using llama. The hugging face using llama has similar tokens per second.

mistral-small:24b-instruct-2501-q4_K_M	00:00:14.1491177	56.69
mistral-small:24b-instruct-2501-q4_K_M	00:00:13.4088150	56.25
mistral-small:24b-instruct-2501-q4_K_M	00:00:09.0160801	56.25
mistral-small:24b-instruct-2501-q4_K_M	00:00:01.1078162	56.54
mistral-small:24b-instruct-2501-q4_K_M	00:00:06.5794542	56.4
mistral-small:24b-instruct-2501-q4_K_M	00:00:02.7945777	56.69
mistral-small:24b-instruct-2501-q4_K_M	00:00:00.1930713	49.39

mistral-small3.1:24b-instruct-2503-q4_K_M	00:00:45.9877791	12.25
mistral-small3.1:24b-instruct-2503-q4_K_M	00:00:56.6890009	16.59
mistral-small3.1:24b-instruct-2503-q4_K_M	00:00:29.2965343	16.15
mistral-small3.1:24b-instruct-2503-q4_K_M	00:00:07.4886711	12.94
mistral-small3.1:24b-instruct-2503-q4_K_M	00:00:29.0066623	12.99
mistral-small3.1:24b-instruct-2503-q4_K_M	00:00:09.2215099	14.72
mistral-small3.1:24b-instruct-2503-q4_K_M	00:00:00.5051990	21.35

hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	00:00:15.5489702	56.74
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	00:00:16.2742764	56.33
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	00:00:09.8189027	56.3
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	00:00:01.2301923	56.14
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	00:00:05.1030198	56.51
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	00:00:03.1872401	56.34
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	00:00:00.9599150	54.62

mistral-small:24b-instruct-2501-q8_0	00:00:29.3273764	29.94
mistral-small:24b-instruct-2501-q8_0	00:00:31.6365711	29.84
mistral-small:24b-instruct-2501-q8_0	00:00:15.6844838	29.92
mistral-small:24b-instruct-2501-q8_0	00:00:01.5447448	30.1
mistral-small:24b-instruct-2501-q8_0	00:00:08.9758766	29.93
mistral-small:24b-instruct-2501-q8_0	00:00:05.2431780	29.99
mistral-small:24b-instruct-2501-q8_0	00:00:00.2929514	31.5

mistral-small3.1:24b-instruct-2503-q8_0	00:00:45.3856788	15.64
mistral-small3.1:24b-instruct-2503-q8_0	00:01:26.4857499	13.34
mistral-small3.1:24b-instruct-2503-q8_0	00:00:34.0899852	14.12
mistral-small3.1:24b-instruct-2503-q8_0	00:00:04.2336628	15.58
mistral-small3.1:24b-instruct-2503-q8_0	00:00:22.6926162	14.09
mistral-small3.1:24b-instruct-2503-q8_0	00:00:10.7818053	13.58
mistral-small3.1:24b-instruct-2503-q8_0	00:00:00.7040794	14.33

The memory used is also different. The following is model, context size, and memory used. There is also something very weird with the memory allocations with the 3.1 Ollama mistral3 version

mistral-small:24b-instruct-2501-q4_K_M	2048	14.3GiB
mistral-small:24b-instruct-2501-q4_K_M	4096	14.7GiB
mistral-small:24b-instruct-2501-q4_K_M	8192	15.6GiB
mistral-small:24b-instruct-2501-q4_K_M	16384	17.4GiB
mistral-small:24b-instruct-2501-q4_K_M	32768	21.0GiB

mistral-small3.1:24b-instruct-2503-q4_K_M	2048	24.9GiB
mistral-small3.1:24b-instruct-2503-q4_K_M	4096	25.6GiB
mistral-small3.1:24b-instruct-2503-q4_K_M	8192	27.1GiB
mistral-small3.1:24b-instruct-2503-q4_K_M	16384	30.1GiB
mistral-small3.1:24b-instruct-2503-q4_K_M	32768	36.1GiB
mistral-small3.1:24b-instruct-2503-q4_K_M	65536	41.0GiB
mistral-small3.1:24b-instruct-2503-q4_K_M	131072	33.9GiB

hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	2048	14.3GiB
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	4096	14.7GiB
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	8192	15.6GiB
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	16384	17.4GiB
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	32768	21.0GiB
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	65536	34.0GiB
hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M	131072	53.8GiB


mistral-small:24b-instruct-2501-q8_0	2048	26.5GiB
mistral-small:24b-instruct-2501-q8_0	4096	26.9GiB
mistral-small:24b-instruct-2501-q8_0	8192	27.5GiB
mistral-small:24b-instruct-2501-q8_0	16384	29.6GiB
mistral-small:24b-instruct-2501-q8_0	32768	34.4GiB

mistral-small3.1:24b-instruct-2503-q8_0	2048	35.1GiB
mistral-small3.1:24b-instruct-2503-q8_0	4096	35.8GiB
mistral-small3.1:24b-instruct-2503-q8_0	8192	37.3GiB
mistral-small3.1:24b-instruct-2503-q8_0	16384	40.3GiB
mistral-small3.1:24b-instruct-2503-q8_0	32768	46.3GiB
mistral-small3.1:24b-instruct-2503-q8_0	65536	50.3GiB
mistral-small3.1:24b-instruct-2503-q8_0	131072	42.8GiB


ollama show hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M
  Model
    architecture        llama
    parameters          23.6B
    context length      131072
    embedding length    5120
    quantization        unknown

  Capabilities
    completion
    tools

  Projector
    architecture        clip
    parameters          438.96M
    embedding length    1024
    dimensions          5120

  Parameters
    stop    "[INST]"

ollama show mistral-small3.1:24b-instruct-2503-q4_K_M
  Model
    architecture        mistral3
    parameters          24.0B
    context length      131072
    embedding length    5120
    quantization        Q4_K_M

  Capabilities
    completion
    vision
    tools

  Parameters
    num_ctx    4096

  System
    You are Mistral Small 3.1, a Large Language Model (LLM) created by Mistral AI, a French startup
      headquartered in Paris.
    You power an AI assistant called Le Chat.



ollama show mistral-small:24b-instruct-2501-q4_K_M
  Model
    architecture        llama
    parameters          23.6B
    context length      32768
    embedding length    5120
    quantization        Q4_K_M

  Capabilities
    completion
    tools

  Parameters
    temperature    0.15

  System
    You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup
      headquartered in Paris. Your knowledge base was last updated on 2023-10-01. When you're not sure
      about some information, you say that you don't have the information and don't make up anything.
      If the user's question is not clear, ambiguous, or does not provide enough context for you to
      accurately answer the question, you do not try to answer it right away and you rather ask the user
      to clarify their request (e.g. "What are some good restaurants around me?" => "Where are you?" or
      "When is the next flight to Tokyo" => "Where do you travel from?")

  License
    Apache License
    Version 2.0, January 2004



ollama show mistral-small3.1:24b-instruct-2503-q8_0
  Model
    architecture        mistral3
    parameters          24.0B
    context length      131072
    embedding length    5120
    quantization        Q8_0

  Capabilities
    completion
    vision
    tools

  Parameters
    num_ctx    4096

  System
    You are Mistral Small 3.1, a Large Language Model (LLM) created by Mistral AI, a French startup
      headquartered in Paris.
    You power an AI assistant called Le Chat.

ollama show mistral-small:24b-instruct-2501-q8_0
  Model
    architecture        llama
    parameters          23.6B
    context length      32768
    embedding length    5120
    quantization        Q8_0

  Capabilities
    completion
    tools

  Parameters
    temperature    0.15

  System
    You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup
      headquartered in Paris. Your knowledge base was last updated on 2023-10-01. When you're not sure
      about some information, you say that you don't have the information and don't make up anything.
      If the user's question is not clear, ambiguous, or does not provide enough context for you to
      accurately answer the question, you do not try to answer it right away and you rather ask the user
      to clarify their request (e.g. "What are some good restaurants around me?" => "Where are you?" or
      "When is the next flight to Tokyo" => "Where do you travel from?")

  License
    Apache License
    Version 2.0, January 2004

Relevant log output


OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.6.7

Originally created by @MarkWard0110 on GitHub (May 3, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10553 ### What is the issue? I need help understanding the differences I am finding when running `mistral-small3.1:24b-instruct-2503` Ollama's model repo `mistral-small3.1:24b-instruct-2503` has model architecture `mistral3`. The previous `mistral-small:24b-instruct-2501` model architecture was `llama` The hugging face `hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M` has model architecture `llama` The `mistral-small3.1:24b-instruct-2503` appears to be slower and requires more RAM for the same context sizes. However, `hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M` seems to be closer to `mistral-small:24b-instruct-2501` I have the following metrics recorded for the model, time, and tokens per second. The new model using `mistral3` is significantly slower than the previous release using `llama`. The hugging face using `llama` has similar tokens per second. ``` mistral-small:24b-instruct-2501-q4_K_M 00:00:14.1491177 56.69 mistral-small:24b-instruct-2501-q4_K_M 00:00:13.4088150 56.25 mistral-small:24b-instruct-2501-q4_K_M 00:00:09.0160801 56.25 mistral-small:24b-instruct-2501-q4_K_M 00:00:01.1078162 56.54 mistral-small:24b-instruct-2501-q4_K_M 00:00:06.5794542 56.4 mistral-small:24b-instruct-2501-q4_K_M 00:00:02.7945777 56.69 mistral-small:24b-instruct-2501-q4_K_M 00:00:00.1930713 49.39 mistral-small3.1:24b-instruct-2503-q4_K_M 00:00:45.9877791 12.25 mistral-small3.1:24b-instruct-2503-q4_K_M 00:00:56.6890009 16.59 mistral-small3.1:24b-instruct-2503-q4_K_M 00:00:29.2965343 16.15 mistral-small3.1:24b-instruct-2503-q4_K_M 00:00:07.4886711 12.94 mistral-small3.1:24b-instruct-2503-q4_K_M 00:00:29.0066623 12.99 mistral-small3.1:24b-instruct-2503-q4_K_M 00:00:09.2215099 14.72 mistral-small3.1:24b-instruct-2503-q4_K_M 00:00:00.5051990 21.35 hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 00:00:15.5489702 56.74 hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 00:00:16.2742764 56.33 hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 00:00:09.8189027 56.3 hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 00:00:01.2301923 56.14 hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 00:00:05.1030198 56.51 hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 00:00:03.1872401 56.34 hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 00:00:00.9599150 54.62 mistral-small:24b-instruct-2501-q8_0 00:00:29.3273764 29.94 mistral-small:24b-instruct-2501-q8_0 00:00:31.6365711 29.84 mistral-small:24b-instruct-2501-q8_0 00:00:15.6844838 29.92 mistral-small:24b-instruct-2501-q8_0 00:00:01.5447448 30.1 mistral-small:24b-instruct-2501-q8_0 00:00:08.9758766 29.93 mistral-small:24b-instruct-2501-q8_0 00:00:05.2431780 29.99 mistral-small:24b-instruct-2501-q8_0 00:00:00.2929514 31.5 mistral-small3.1:24b-instruct-2503-q8_0 00:00:45.3856788 15.64 mistral-small3.1:24b-instruct-2503-q8_0 00:01:26.4857499 13.34 mistral-small3.1:24b-instruct-2503-q8_0 00:00:34.0899852 14.12 mistral-small3.1:24b-instruct-2503-q8_0 00:00:04.2336628 15.58 mistral-small3.1:24b-instruct-2503-q8_0 00:00:22.6926162 14.09 mistral-small3.1:24b-instruct-2503-q8_0 00:00:10.7818053 13.58 mistral-small3.1:24b-instruct-2503-q8_0 00:00:00.7040794 14.33 ``` The memory used is also different. The following is model, context size, and memory used. There is also something very weird with the memory allocations with the 3.1 Ollama `mistral3` version ``` mistral-small:24b-instruct-2501-q4_K_M 2048 14.3GiB mistral-small:24b-instruct-2501-q4_K_M 4096 14.7GiB mistral-small:24b-instruct-2501-q4_K_M 8192 15.6GiB mistral-small:24b-instruct-2501-q4_K_M 16384 17.4GiB mistral-small:24b-instruct-2501-q4_K_M 32768 21.0GiB mistral-small3.1:24b-instruct-2503-q4_K_M 2048 24.9GiB mistral-small3.1:24b-instruct-2503-q4_K_M 4096 25.6GiB mistral-small3.1:24b-instruct-2503-q4_K_M 8192 27.1GiB mistral-small3.1:24b-instruct-2503-q4_K_M 16384 30.1GiB mistral-small3.1:24b-instruct-2503-q4_K_M 32768 36.1GiB mistral-small3.1:24b-instruct-2503-q4_K_M 65536 41.0GiB mistral-small3.1:24b-instruct-2503-q4_K_M 131072 33.9GiB hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 2048 14.3GiB hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 4096 14.7GiB hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 8192 15.6GiB hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 16384 17.4GiB hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 32768 21.0GiB hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 65536 34.0GiB hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M 131072 53.8GiB mistral-small:24b-instruct-2501-q8_0 2048 26.5GiB mistral-small:24b-instruct-2501-q8_0 4096 26.9GiB mistral-small:24b-instruct-2501-q8_0 8192 27.5GiB mistral-small:24b-instruct-2501-q8_0 16384 29.6GiB mistral-small:24b-instruct-2501-q8_0 32768 34.4GiB mistral-small3.1:24b-instruct-2503-q8_0 2048 35.1GiB mistral-small3.1:24b-instruct-2503-q8_0 4096 35.8GiB mistral-small3.1:24b-instruct-2503-q8_0 8192 37.3GiB mistral-small3.1:24b-instruct-2503-q8_0 16384 40.3GiB mistral-small3.1:24b-instruct-2503-q8_0 32768 46.3GiB mistral-small3.1:24b-instruct-2503-q8_0 65536 50.3GiB mistral-small3.1:24b-instruct-2503-q8_0 131072 42.8GiB ``` ``` ollama show hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M Model architecture llama parameters 23.6B context length 131072 embedding length 5120 quantization unknown Capabilities completion tools Projector architecture clip parameters 438.96M embedding length 1024 dimensions 5120 Parameters stop "[INST]" ollama show mistral-small3.1:24b-instruct-2503-q4_K_M Model architecture mistral3 parameters 24.0B context length 131072 embedding length 5120 quantization Q4_K_M Capabilities completion vision tools Parameters num_ctx 4096 System You are Mistral Small 3.1, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris. You power an AI assistant called Le Chat. ollama show mistral-small:24b-instruct-2501-q4_K_M Model architecture llama parameters 23.6B context length 32768 embedding length 5120 quantization Q4_K_M Capabilities completion tools Parameters temperature 0.15 System You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris. Your knowledge base was last updated on 2023-10-01. When you're not sure about some information, you say that you don't have the information and don't make up anything. If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. "What are some good restaurants around me?" => "Where are you?" or "When is the next flight to Tokyo" => "Where do you travel from?") License Apache License Version 2.0, January 2004 ollama show mistral-small3.1:24b-instruct-2503-q8_0 Model architecture mistral3 parameters 24.0B context length 131072 embedding length 5120 quantization Q8_0 Capabilities completion vision tools Parameters num_ctx 4096 System You are Mistral Small 3.1, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris. You power an AI assistant called Le Chat. ollama show mistral-small:24b-instruct-2501-q8_0 Model architecture llama parameters 23.6B context length 32768 embedding length 5120 quantization Q8_0 Capabilities completion tools Parameters temperature 0.15 System You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris. Your knowledge base was last updated on 2023-10-01. When you're not sure about some information, you say that you don't have the information and don't make up anything. If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. "What are some good restaurants around me?" => "Where are you?" or "When is the next flight to Tokyo" => "Where do you travel from?") License Apache License Version 2.0, January 2004 ``` ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.7
GiteaMirror added the bug label 2026-04-29 03:15:55 -05:00
Author
Owner

@rick-github commented on GitHub (May 4, 2025):

You don't indicate the hardware that you are running on. If I test this on 2x a100 with 40G, I get comparable results:

$ for i in mistral-small:24b-instruct-2501-q4_K_M mistral-small3.1:24b-instruct-2503-q4_K_M  ; do ollama run --verbose $i hello ; done
Hello! How can I assist you today?

total duration:       6.268738597s
load duration:        5.73568526s
prompt eval count:    162 token(s)
prompt eval duration: 337.648618ms
prompt eval rate:     479.79 tokens/s
eval count:           10 token(s)
eval duration:        194.323291ms
eval rate:            51.46 tokens/s

Hello! How can I assist you today?

total duration:       6.516360924s
load duration:        5.840908652s
prompt eval count:    359 token(s)
prompt eval duration: 500.315958ms
prompt eval rate:     717.55 tokens/s
eval count:           10 token(s)
eval duration:        174.212716ms
eval rate:            57.40 tokens/s
$ ollama ps
NAME                                         ID              SIZE     PROCESSOR    UNTIL            
mistral-small3.1:24b-instruct-2503-q4_K_M    b9aaf0c2586a    26 GB    100% GPU     2 hours from now    
mistral-small:24b-instruct-2501-q4_K_M       8039dd90c113    15 GB    100% GPU     2 hours from now 

Note that 3.1 can process images and the weights for the vision model results in a much larger VRAM footprint. This can cause the model to spill layers to system RAM, resulting in slower inference.

I was unable to test the bartowski model since the version I downloaded has an unsupported vision projector. The repo shows that this was added a couple of days ago for llama.cpp support.

<!-- gh-comment-id:2849187097 --> @rick-github commented on GitHub (May 4, 2025): You don't indicate the hardware that you are running on. If I test this on 2x a100 with 40G, I get comparable results: ```console $ for i in mistral-small:24b-instruct-2501-q4_K_M mistral-small3.1:24b-instruct-2503-q4_K_M ; do ollama run --verbose $i hello ; done Hello! How can I assist you today? total duration: 6.268738597s load duration: 5.73568526s prompt eval count: 162 token(s) prompt eval duration: 337.648618ms prompt eval rate: 479.79 tokens/s eval count: 10 token(s) eval duration: 194.323291ms eval rate: 51.46 tokens/s Hello! How can I assist you today? total duration: 6.516360924s load duration: 5.840908652s prompt eval count: 359 token(s) prompt eval duration: 500.315958ms prompt eval rate: 717.55 tokens/s eval count: 10 token(s) eval duration: 174.212716ms eval rate: 57.40 tokens/s $ ollama ps NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b-instruct-2503-q4_K_M b9aaf0c2586a 26 GB 100% GPU 2 hours from now mistral-small:24b-instruct-2501-q4_K_M 8039dd90c113 15 GB 100% GPU 2 hours from now ``` Note that 3.1 can process images and the weights for the vision model results in a much larger VRAM footprint. This can cause the model to spill layers to system RAM, resulting in slower inference. I was unable to test the bartowski model since the version I downloaded has an unsupported vision projector. The repo shows that this was [added](https://huggingface.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF/commit/f73dfd9e812922fb503a993e3fa5671424f486d3) a couple of days ago for llama.cpp support.
Author
Owner

@MarkWard0110 commented on GitHub (May 5, 2025):

@rick-github
my hardware
intel core i9 14900K
96GB RAM (6400 MT/s)
RTX 3090 24GB
RTX 4070 TI Super 16GB

so you are saying the difference might be due to the newer 3.1 having the vision model and now that the 3.1 architecture is mistral3 the additional memory and lower tokens could be the vision model is active also? and when I was testing with the bartowski having the llama architecture could have only been the LLM?

I'll see if I can compare with the updated bartowski model but there might be details that I am not understanding. I ran the pull again and it seems there is no change and still has the architecture as llama

ollama pull hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M
pulling manifest
pulling c5743c1bf39d: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████▏  14 GB
pulling 6db27cd4e277: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████▏  695 B
pulling f5add93ad360: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 878 MB
pulling 4d1dedbfd2bd: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████▏   19 B
pulling 57122ba533ca: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████▏  626 B
verifying sha256 digest
writing manifest
success
<!-- gh-comment-id:2852378878 --> @MarkWard0110 commented on GitHub (May 5, 2025): @rick-github my hardware intel core i9 14900K 96GB RAM (6400 MT/s) RTX 3090 24GB RTX 4070 TI Super 16GB so you are saying the difference might be due to the newer 3.1 having the vision model and now that the 3.1 architecture is mistral3 the additional memory and lower tokens could be the vision model is active also? and when I was testing with the bartowski having the llama architecture could have only been the LLM? I'll see if I can compare with the updated bartowski model but there might be details that I am not understanding. I ran the pull again and it seems there is no change and still has the architecture as llama ``` ollama pull hf.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_M pulling manifest pulling c5743c1bf39d: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 14 GB pulling 6db27cd4e277: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 695 B pulling f5add93ad360: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 878 MB pulling 4d1dedbfd2bd: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 19 B pulling 57122ba533ca: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 626 B verifying sha256 digest writing manifest success ```
Author
Owner

@thot-experiment commented on GitHub (May 7, 2025):

"Note that 3.1 can process images and the weights for the vision model results in a much larger VRAM footprint"

@rick-github this is actually incorrect, the vision tower/projector are not 11 gigs and ollama's memory estimation/measurement is still broken

> ollama --version
ollama version is 0.6.8
> ollama ps
NAME            ID              SIZE     PROCESSOR    UNTIL
m31.32k:q5ks    03154eff3cf3    31 GB    100% GPU     Forever

vs nvidia-smi taken during vision inference

> nvidia-smi
Wed May  7 11:26:21 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 576.02                 Driver Version: 576.02         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   1  Quadro GV100                 WDDM  |   00000000:08:00.0 Off |                  Off |
| 46%   51C    P2            204W /  250W |   19963MiB /  32768MiB |     89%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
<!-- gh-comment-id:2859833738 --> @thot-experiment commented on GitHub (May 7, 2025): > "Note that 3.1 can process images and the weights for the vision model results in a much larger VRAM footprint" @rick-github this is actually incorrect, the vision tower/projector are not 11 gigs and ollama's memory estimation/measurement is still broken ``` > ollama --version ollama version is 0.6.8 > ollama ps NAME ID SIZE PROCESSOR UNTIL m31.32k:q5ks 03154eff3cf3 31 GB 100% GPU Forever ``` vs nvidia-smi taken *during* vision inference ``` > nvidia-smi Wed May 7 11:26:21 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 576.02 Driver Version: 576.02 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 1 Quadro GV100 WDDM | 00000000:08:00.0 Off | Off | | 46% 51C P2 204W / 250W | 19963MiB / 32768MiB | 89% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ```
Author
Owner

@rick-github commented on GitHub (May 7, 2025):

so you are saying the difference might be due to the newer 3.1 having the vision model

For the most part.

additional memory and lower tokens could be the vision model is active

By active meaning loaded into VRAM, yes. Not necessarily being used for inference.

has the architecture as llama

The distinction between llama and mistral3 architecture comes from the different ways they are quantised. bartowski has made a quant from just the text-text weights, and labelled it llama for compatibility with those who want to run it with a base llama.cpp backend. ollama has fused the text-text and image-text weights into a single GGUF and labelled it mistral3 to distinguish from other text-text only files.

"Note that 3.1 can process images and the weights for the vision model results in a much larger VRAM footprint"

@rick-github this is actually incorrect, the vision tower/projector are not 11 gigs

I didn't say that the projector was 11G, although I can see how it might be mis-read. The projector contributes 9.5G to the increased VRAM usage.

m31.32k:q5ks 03154eff3cf3 31 GB 100% GPU Forever
| 46% 51C P2 204W / 250W | 19963MiB / 32768MiB | 89% Default |

ollama's memory estimation/measurement is still broken

Correct, although such a large disparity is not common. If you share logs, the cause might be determined.

<!-- gh-comment-id:2860664060 --> @rick-github commented on GitHub (May 7, 2025): > so you are saying the difference might be due to the newer 3.1 having the vision model For the most part. > additional memory and lower tokens could be the vision model is active By active meaning loaded into VRAM, yes. Not necessarily being used for inference. > has the architecture as llama The distinction between `llama` and `mistral3` architecture comes from the different ways they are quantised. bartowski has made a quant from just the text-text weights, and labelled it `llama` for compatibility with those who want to run it with a base llama.cpp backend. ollama has fused the text-text and image-text weights into a single GGUF and labelled it `mistral3` to distinguish from other text-text only files. > > "Note that 3.1 can process images and the weights for the vision model results in a much larger VRAM footprint" > > [@rick-github](https://github.com/rick-github) this is actually incorrect, the vision tower/projector are not 11 gigs I didn't say that the projector was 11G, although I can see how it might be mis-read. The projector contributes 9.5G to the increased VRAM usage. > m31.32k:q5ks 03154eff3cf3 31 GB 100% GPU Forever > | 46% 51C P2 204W / 250W | 19963MiB / 32768MiB | 89% Default | > ollama's memory estimation/measurement is still broken Correct, although such a large disparity is not common. If you share logs, the cause might be determined.
Author
Owner

@thot-experiment commented on GitHub (May 7, 2025):

The projector contributes 9.5G to the increased VRAM usage.

interesting, that still seems wrong as my non-vision q5ks GGUF uses ~16gb during inference iirc so it's only a 4gb delta. I made the compatible quant myself using ollama's internal quantization tool, does that also quant the vision projector?

If you share logs, the cause might be determined.

sure, what logs would you like? def eager to help as this is causing me no end of headaches
(currently have to shut down everything that uses vram before loading mistral and then load it back up after or ollama splits the model to CPU)

p.s. is there a way to either cause ollama to ignore it's memory estimates and try loading anyway or to error out if a model would be split to CPU, it would be preferable for me to have an error than to have the model opaquely split to CPU without any indication (i am currently doing this manually by calling /ps)

<!-- gh-comment-id:2860687302 --> @thot-experiment commented on GitHub (May 7, 2025): > The projector contributes 9.5G to the increased VRAM usage. interesting, that still seems wrong as my non-vision q5ks GGUF uses ~16gb during inference iirc so it's only a 4gb delta. I made the compatible quant myself using ollama's internal quantization tool, does that also quant the vision projector? > If you share logs, the cause might be determined. sure, what logs would you like? def eager to help as this is causing me no end of headaches (currently have to shut down everything that uses vram before loading mistral and then load it back up after or ollama splits the model to CPU) p.s. is there a way to either cause ollama to ignore it's memory estimates and try loading anyway or to error out if a model would be split to CPU, it would be preferable for me to have an error than to have the model opaquely split to CPU without any indication (i am currently doing this manually by calling `/ps`)
Author
Owner

@rick-github commented on GitHub (May 7, 2025):

sure, what logs would you like?

https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues

p.s. is there a way to either cause ollama to ignore it's memory estimates and try loading anyway

Set num_gpu to the layer count of the model, or 999 for q&d. Possible OOMs and performance issues. See
https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650.

<!-- gh-comment-id:2860746904 --> @rick-github commented on GitHub (May 7, 2025): > sure, what logs would you like? https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues > p.s. is there a way to either cause ollama to ignore it's memory estimates and try loading anyway Set `num_gpu` to the layer count of the model, or 999 for q&d. Possible OOMs and [performance issues](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900). See https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650.
Author
Owner

@MarkWard0110 commented on GitHub (May 17, 2025):

After upgrading to Ollama 0.7.0

mistral-small3.1:24b-instruct-2503-q4_K_M is somehow split between CPU and GPU when it was 100% GPU before 0.7.0.
I have 2 nvidia GPUs giving a total of 40 GB VRAM. 1 3090 24GB and 1 4070 TI Super 16GB
what it appears is when the mistral model is loaded it is only handling the first GPU 3090. 24 of 35 is 68%. That math seems to explain what ollama ps is reporting.

C:\Users\wardm> ollama ps
NAME                                         ID              SIZE     PROCESSOR          UNTIL
mistral-small3.1:24b-instruct-2503-q4_K_M    b9aaf0c2586a    35 GB    31%/69% CPU/GPU    5 minutes from now
C:\Users\wardm> nvidia-smi
Fri May 16 20:57:09 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 576.28                 Driver Version: 576.28         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:01:00.0 Off |                  N/A |
|  0%   45C    P8             21W /  370W |    8942MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4070 ...  WDDM  |   00000000:08:00.0 Off |                  N/A |
|  0%   36C    P8              7W /  285W |     208MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           66348      C   ...al\Programs\Ollama\ollama.exe      N/A      |
|    1   N/A  N/A           66348      C   ...al\Programs\Ollama\ollama.exe      N/A      |
+-----------------------------------------------------------------------------------------+
C:\Users\wardm> ollama ps
NAME                                         ID              SIZE     PROCESSOR          UNTIL
mistral-small3.1:24b-instruct-2503-q4_K_M    b9aaf0c2586a    35 GB    31%/69% CPU/GPU    4 minutes from now

maybe Ollama does not see both cards you say? Qwen3 would like to show you all 35GB of it fit in the GPU.

C:\Users\wardm> ollama ps
NAME                    ID              SIZE     PROCESSOR    UNTIL
qwen3:30b-a3b-q4_K_M    2ee832bc15b5    35 GB    100% GPU     Forever
C:\Users\wardm> nvidia-smi
Fri May 16 21:14:23 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 576.28                 Driver Version: 576.28         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:01:00.0 Off |                  N/A |
|  0%   49C    P3             73W /  370W |   13520MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4070 ...  WDDM  |   00000000:08:00.0 Off |                  N/A |
|  0%   43C    P3             25W /  285W |    9484MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           68288      C   ...al\Programs\Ollama\ollama.exe      N/A      |
|    1   N/A  N/A           68288      C   ...al\Programs\Ollama\ollama.exe      N/A      |
+-----------------------------------------------------------------------------------------+
<!-- gh-comment-id:2887976989 --> @MarkWard0110 commented on GitHub (May 17, 2025): After upgrading to Ollama 0.7.0 mistral-small3.1:24b-instruct-2503-q4_K_M is somehow split between CPU and GPU when it was 100% GPU before 0.7.0. I have 2 nvidia GPUs giving a total of 40 GB VRAM. 1 3090 24GB and 1 4070 TI Super 16GB what it appears is when the mistral model is loaded it is only handling the first GPU 3090. 24 of 35 is 68%. That math seems to explain what ollama ps is reporting. ``` C:\Users\wardm> ollama ps NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b-instruct-2503-q4_K_M b9aaf0c2586a 35 GB 31%/69% CPU/GPU 5 minutes from now C:\Users\wardm> nvidia-smi Fri May 16 20:57:09 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 576.28 Driver Version: 576.28 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:01:00.0 Off | N/A | | 0% 45C P8 21W / 370W | 8942MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4070 ... WDDM | 00000000:08:00.0 Off | N/A | | 0% 36C P8 7W / 285W | 208MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 66348 C ...al\Programs\Ollama\ollama.exe N/A | | 1 N/A N/A 66348 C ...al\Programs\Ollama\ollama.exe N/A | +-----------------------------------------------------------------------------------------+ C:\Users\wardm> ollama ps NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b-instruct-2503-q4_K_M b9aaf0c2586a 35 GB 31%/69% CPU/GPU 4 minutes from now ``` maybe Ollama does not see both cards you say? Qwen3 would like to show you all 35GB of it fit in the GPU. ``` C:\Users\wardm> ollama ps NAME ID SIZE PROCESSOR UNTIL qwen3:30b-a3b-q4_K_M 2ee832bc15b5 35 GB 100% GPU Forever C:\Users\wardm> nvidia-smi Fri May 16 21:14:23 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 576.28 Driver Version: 576.28 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:01:00.0 Off | N/A | | 0% 49C P3 73W / 370W | 13520MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4070 ... WDDM | 00000000:08:00.0 Off | N/A | | 0% 43C P3 25W / 285W | 9484MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 68288 C ...al\Programs\Ollama\ollama.exe N/A | | 1 N/A N/A 68288 C ...al\Programs\Ollama\ollama.exe N/A | +-----------------------------------------------------------------------------------------+ ```
Author
Owner

@MarkWard0110 commented on GitHub (May 25, 2025):

Something else is weird. Upgraded to 0.7.1

What is weird is I just RAN this model earlier today and now


time=2025-05-24T19:12:07.132-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc library=cuda parallel=1 required="25.6 GiB"
time=2025-05-24T19:12:07.164-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="66.8 GiB" free_swap="54.3 GiB"
time=2025-05-24T19:12:07.198-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[22.8 GiB 14.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="25.6 GiB" memory.required.partial="25.6 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[17.8 GiB 7.9 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="426.7 MiB" memory.graph.partial="426.7 MiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-05-24T19:12:07.198-05:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-05-24T19:12:07.198-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
time=2025-05-24T19:12:07.246-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 4096 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 57105"
time=2025-05-24T19:12:07.251-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-05-24T19:12:07.251-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-24T19:12:07.253-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-05-24T19:12:07.281-05:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-05-24T19:12:07.282-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:57105"
time=2025-05-24T19:12:07.300-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43
load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-05-24T19:12:07.446-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-05-24T19:12:07.504-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-24T19:12:07.516-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="525.0 MiB"
time=2025-05-24T19:12:07.516-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="6.7 GiB"
time=2025-05-24T19:12:07.516-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="7.2 GiB"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 9337.48 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 9791055360
time=2025-05-24T19:12:08.270-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-05-24T19:12:08.270-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-05-24T19:12:08.270-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
panic: insufficient memory - required allocations: {InputWeights:550502400A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} GPUs:[{Name:CUDA0 Weights:[363438080A 363438080A 363438080A 363438080A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 363438080A 363438080A 363438080A 363438080A 363438080A 1255526400A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:9791055360F}]}

goroutine 37 [running]:
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc001174140)
	C:/a/ollama/ollama/ml/backend/ggml/ggml.go:643 +0x756
github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getTensor(0xc0000499f8?, {0x7ff7443ae7f0, 0xc0002ca120}, {0x7ff7443b2b68, 0xc001174c80}, {0x7ff7443bec48, 0xc0010cd9f8}, 0x1)
	C:/a/ollama/ollama/runner/ollamarunner/multimodal.go:98 +0x2a4
github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getMultimodal(0xc000049cd8, {0x7ff7443ae7f0, 0xc0002ca120}, {0x7ff7443b2b68, 0xc001174c80}, {0xc00115e0a0, 0x1, 0x2b3e0021718?}, 0x1)
	C:/a/ollama/ollama/runner/ollamarunner/multimodal.go:56 +0xe5
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc0004c26c0)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:796 +0x70e
github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc0004c26c0, {0xc0000dc000?, 0x0?}, {0x8, 0x0, 0x29, {0xc000161930, 0x2, 0x2}, 0x1}, ...)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270
github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc0004c26c0, {0x7ff7443aa7c0, 0xc000623450}, {0xc0000dc000?, 0x0?}, {0x8, 0x0, 0x29, {0xc000161930, 0x2, ...}, ...}, ...)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11
time=2025-05-24T19:12:08.375-05:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
time=2025-05-24T19:12:08.506-05:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory"

But, I can load a 70b model into VRAM just right after?


time=2025-05-24T19:12:13.554-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0486821 runner.size="25.6 GiB" runner.vram="25.6 GiB" runner.parallel=1 runner.pid=9508 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-05-24T19:12:13.798-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2921949 runner.size="25.6 GiB" runner.vram="25.6 GiB" runner.parallel=1 runner.pid=9508 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-05-24T19:12:14.048-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5424168 runner.size="25.6 GiB" runner.vram="25.6 GiB" runner.parallel=1 runner.pid=9508 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-05-24T19:14:00.424-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-dae4a55f7017b5454065aa9294d8f4acb05dc9a9f87696479c461ce28b08dcc3 library=cuda parallel=1 required="36.6 GiB"
time=2025-05-24T19:14:00.456-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="66.7 GiB" free_swap="54.2 GiB"
time=2025-05-24T19:14:00.487-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=81 layers.offload=81 layers.split=49,32 memory.available="[22.8 GiB 14.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="36.6 GiB" memory.required.partial="36.6 GiB" memory.required.kv="1.2 GiB" memory.required.allocations="[22.0 GiB 14.7 GiB]" memory.weights.total="31.5 GiB" memory.weights.repeating="30.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2025-05-24T19:14:00.487-05:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-05-24T19:14:00.487-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from o:\ollama\models\blobs\sha256-dae4a55f7017b5454065aa9294d8f4acb05dc9a9f87696479c461ce28b08dcc3 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.1 70B Instruct 2024 12
llama_model_loader: - kv   3:                            general.version str              = 2024-12
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Llama-3.1
llama_model_loader: - kv   6:                         general.size_label str              = 70B
llama_model_loader: - kv   7:                            general.license str              = llama3.1
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Llama 3.1 70B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Meta Llama
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv  12:                               general.tags arr[str,5]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv  13:                          general.languages arr[str,7]       = ["fr", "it", "pt", "hi", "es", "th", ...
llama_model_loader: - kv  14:                          llama.block_count u32              = 80
llama_model_loader: - kv  15:                       llama.context_length u32              = 131072
llama_model_loader: - kv  16:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  17:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  18:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  19:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  20:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  21:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  22:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  23:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  24:                          general.file_type u32              = 12
llama_model_loader: - kv  25:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  26:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q3_K:  321 tensors
llama_model_loader: - type q4_K:  155 tensors
llama_model_loader: - type q5_K:   85 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Medium
print_info: file size   = 31.91 GiB (3.88 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 70.55 B
print_info: general.name     = Llama 3.1 70B Instruct 2024 12
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-05-24T19:14:00.720-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model o:\\ollama\\models\\blobs\\sha256-dae4a55f7017b5454065aa9294d8f4acb05dc9a9f87696479c461ce28b08dcc3 --ctx-size 4096 --batch-size 512 --n-gpu-layers 81 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 49,32 --port 57915"
time=2025-05-24T19:14:00.724-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-05-24T19:14:00.724-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-24T19:14:00.725-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-05-24T19:14:00.755-05:00 level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-05-24T19:14:00.899-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-05-24T19:14:00.900-05:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:57915"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4070 Ti SUPER) - 15089 MiB free
time=2025-05-24T19:14:00.976-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from o:\ollama\models\blobs\sha256-dae4a55f7017b5454065aa9294d8f4acb05dc9a9f87696479c461ce28b08dcc3 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.1 70B Instruct 2024 12
llama_model_loader: - kv   3:                            general.version str              = 2024-12
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Llama-3.1
llama_model_loader: - kv   6:                         general.size_label str              = 70B
llama_model_loader: - kv   7:                            general.license str              = llama3.1
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Llama 3.1 70B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Meta Llama
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv  12:                               general.tags arr[str,5]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv  13:                          general.languages arr[str,7]       = ["fr", "it", "pt", "hi", "es", "th", ...
llama_model_loader: - kv  14:                          llama.block_count u32              = 80
llama_model_loader: - kv  15:                       llama.context_length u32              = 131072
llama_model_loader: - kv  16:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  17:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  18:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  19:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  20:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  21:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  22:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  23:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  24:                          general.file_type u32              = 12
llama_model_loader: - kv  25:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  26:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q3_K:  321 tensors
llama_model_loader: - type q4_K:  155 tensors
llama_model_loader: - type q5_K:   85 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Medium
print_info: file size   = 31.91 GiB (3.88 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 8192
print_info: n_layer          = 80
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 28672
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 70B
print_info: model params     = 70.55 B
print_info: general.name     = Llama 3.1 70B Instruct 2024 12
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors:        CUDA0 model buffer size = 19299.00 MiB
load_tensors:        CUDA1 model buffer size = 12942.98 MiB
load_tensors:          CPU model buffer size =   430.55 MiB
[GIN] 2025/05/24 - 19:14:03 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/24 - 19:14:03 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.52 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1, padding = 256
llama_kv_cache_unified:      CUDA0 KV buffer size =   784.00 MiB
llama_kv_cache_unified:      CUDA1 KV buffer size =   496.00 MiB
llama_kv_cache_unified: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA0 compute buffer size =   260.01 MiB
llama_context:      CUDA1 compute buffer size =   346.52 MiB
llama_context:  CUDA_Host compute buffer size =    48.02 MiB
llama_context: graph nodes  = 2407
llama_context: graph splits = 3
time=2025-05-24T19:14:12.757-05:00 level=INFO source=server.go:630 msg="llama runner started in 12.03 seconds"
[GIN] 2025/05/24 - 19:14:13 | 200 |   13.3879384s |       10.0.0.25 | POST     "/api/chat"

I have restarted the Ollama service and still get the same issue. The only thing I have not tried yet is reboot the computer.

<!-- gh-comment-id:2907520471 --> @MarkWard0110 commented on GitHub (May 25, 2025): Something else is weird. Upgraded to `0.7.1` What is weird is I just RAN this model earlier today and now ``` time=2025-05-24T19:12:07.132-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc library=cuda parallel=1 required="25.6 GiB" time=2025-05-24T19:12:07.164-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="66.8 GiB" free_swap="54.3 GiB" time=2025-05-24T19:12:07.198-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[22.8 GiB 14.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="25.6 GiB" memory.required.partial="25.6 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[17.8 GiB 7.9 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="426.7 MiB" memory.graph.partial="426.7 MiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-05-24T19:12:07.198-05:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-05-24T19:12:07.198-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" time=2025-05-24T19:12:07.246-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 4096 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 57105" time=2025-05-24T19:12:07.251-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-05-24T19:12:07.251-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-24T19:12:07.253-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-05-24T19:12:07.281-05:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-05-24T19:12:07.282-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:57105" time=2025-05-24T19:12:07.300-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43 load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-05-24T19:12:07.446-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-05-24T19:12:07.504-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-24T19:12:07.516-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="525.0 MiB" time=2025-05-24T19:12:07.516-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="6.7 GiB" time=2025-05-24T19:12:07.516-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="7.2 GiB" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 9337.48 MiB on device 1: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 9791055360 time=2025-05-24T19:12:08.270-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-05-24T19:12:08.270-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-05-24T19:12:08.270-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" panic: insufficient memory - required allocations: {InputWeights:550502400A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} GPUs:[{Name:CUDA0 Weights:[363438080A 363438080A 363438080A 363438080A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 363438080A 363438080A 363438080A 363438080A 363438080A 1255526400A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:9791055360F}]} goroutine 37 [running]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc001174140) C:/a/ollama/ollama/ml/backend/ggml/ggml.go:643 +0x756 github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getTensor(0xc0000499f8?, {0x7ff7443ae7f0, 0xc0002ca120}, {0x7ff7443b2b68, 0xc001174c80}, {0x7ff7443bec48, 0xc0010cd9f8}, 0x1) C:/a/ollama/ollama/runner/ollamarunner/multimodal.go:98 +0x2a4 github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getMultimodal(0xc000049cd8, {0x7ff7443ae7f0, 0xc0002ca120}, {0x7ff7443b2b68, 0xc001174c80}, {0xc00115e0a0, 0x1, 0x2b3e0021718?}, 0x1) C:/a/ollama/ollama/runner/ollamarunner/multimodal.go:56 +0xe5 github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc0004c26c0) C:/a/ollama/ollama/runner/ollamarunner/runner.go:796 +0x70e github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc0004c26c0, {0xc0000dc000?, 0x0?}, {0x8, 0x0, 0x29, {0xc000161930, 0x2, 0x2}, 0x1}, ...) C:/a/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270 github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc0004c26c0, {0x7ff7443aa7c0, 0xc000623450}, {0xc0000dc000?, 0x0?}, {0x8, 0x0, 0x29, {0xc000161930, 0x2, ...}, ...}, ...) C:/a/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 C:/a/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11 time=2025-05-24T19:12:08.375-05:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" time=2025-05-24T19:12:08.506-05:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory" ``` But, I can load a 70b model into VRAM just right after? ``` time=2025-05-24T19:12:13.554-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0486821 runner.size="25.6 GiB" runner.vram="25.6 GiB" runner.parallel=1 runner.pid=9508 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-05-24T19:12:13.798-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2921949 runner.size="25.6 GiB" runner.vram="25.6 GiB" runner.parallel=1 runner.pid=9508 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-05-24T19:12:14.048-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5424168 runner.size="25.6 GiB" runner.vram="25.6 GiB" runner.parallel=1 runner.pid=9508 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-05-24T19:14:00.424-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-dae4a55f7017b5454065aa9294d8f4acb05dc9a9f87696479c461ce28b08dcc3 library=cuda parallel=1 required="36.6 GiB" time=2025-05-24T19:14:00.456-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="66.7 GiB" free_swap="54.2 GiB" time=2025-05-24T19:14:00.487-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=81 layers.offload=81 layers.split=49,32 memory.available="[22.8 GiB 14.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="36.6 GiB" memory.required.partial="36.6 GiB" memory.required.kv="1.2 GiB" memory.required.allocations="[22.0 GiB 14.7 GiB]" memory.weights.total="31.5 GiB" memory.weights.repeating="30.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" time=2025-05-24T19:14:00.487-05:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-05-24T19:14:00.487-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from o:\ollama\models\blobs\sha256-dae4a55f7017b5454065aa9294d8f4acb05dc9a9f87696479c461ce28b08dcc3 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12 llama_model_loader: - kv 3: general.version str = 2024-12 llama_model_loader: - kv 4: general.finetune str = Instruct llama_model_loader: - kv 5: general.basename str = Llama-3.1 llama_model_loader: - kv 6: general.size_label str = 70B llama_model_loader: - kv 7: general.license str = llama3.1 llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ... llama_model_loader: - kv 14: llama.block_count u32 = 80 llama_model_loader: - kv 15: llama.context_length u32 = 131072 llama_model_loader: - kv 16: llama.embedding_length u32 = 8192 llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 18: llama.attention.head_count u32 = 64 llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 22: llama.attention.key_length u32 = 128 llama_model_loader: - kv 23: llama.attention.value_length u32 = 128 llama_model_loader: - kv 24: general.file_type u32 = 12 llama_model_loader: - kv 25: llama.vocab_size u32 = 128256 llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q3_K: 321 tensors llama_model_loader: - type q4_K: 155 tensors llama_model_loader: - type q5_K: 85 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q3_K - Medium print_info: file size = 31.91 GiB (3.88 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 70.55 B print_info: general.name = Llama 3.1 70B Instruct 2024 12 print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-05-24T19:14:00.720-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model o:\\ollama\\models\\blobs\\sha256-dae4a55f7017b5454065aa9294d8f4acb05dc9a9f87696479c461ce28b08dcc3 --ctx-size 4096 --batch-size 512 --n-gpu-layers 81 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 49,32 --port 57915" time=2025-05-24T19:14:00.724-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-05-24T19:14:00.724-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-24T19:14:00.725-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-05-24T19:14:00.755-05:00 level=INFO source=runner.go:815 msg="starting go runner" load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-05-24T19:14:00.899-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-05-24T19:14:00.900-05:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:57915" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4070 Ti SUPER) - 15089 MiB free time=2025-05-24T19:14:00.976-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from o:\ollama\models\blobs\sha256-dae4a55f7017b5454065aa9294d8f4acb05dc9a9f87696479c461ce28b08dcc3 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12 llama_model_loader: - kv 3: general.version str = 2024-12 llama_model_loader: - kv 4: general.finetune str = Instruct llama_model_loader: - kv 5: general.basename str = Llama-3.1 llama_model_loader: - kv 6: general.size_label str = 70B llama_model_loader: - kv 7: general.license str = llama3.1 llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ... llama_model_loader: - kv 14: llama.block_count u32 = 80 llama_model_loader: - kv 15: llama.context_length u32 = 131072 llama_model_loader: - kv 16: llama.embedding_length u32 = 8192 llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 18: llama.attention.head_count u32 = 64 llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 22: llama.attention.key_length u32 = 128 llama_model_loader: - kv 23: llama.attention.value_length u32 = 128 llama_model_loader: - kv 24: general.file_type u32 = 12 llama_model_loader: - kv 25: llama.vocab_size u32 = 128256 llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q3_K: 321 tensors llama_model_loader: - type q4_K: 155 tensors llama_model_loader: - type q5_K: 85 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q3_K - Medium print_info: file size = 31.91 GiB (3.88 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 8192 print_info: n_layer = 80 print_info: n_head = 64 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 28672 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 70B print_info: model params = 70.55 B print_info: general.name = Llama 3.1 70B Instruct 2024 12 print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 80 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 81/81 layers to GPU load_tensors: CUDA0 model buffer size = 19299.00 MiB load_tensors: CUDA1 model buffer size = 12942.98 MiB load_tensors: CPU model buffer size = 430.55 MiB [GIN] 2025/05/24 - 19:14:03 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/05/24 - 19:14:03 | 200 | 0s | 127.0.0.1 | GET "/api/ps" llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 0.52 MiB llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1, padding = 256 llama_kv_cache_unified: CUDA0 KV buffer size = 784.00 MiB llama_kv_cache_unified: CUDA1 KV buffer size = 496.00 MiB llama_kv_cache_unified: KV self size = 1280.00 MiB, K (f16): 640.00 MiB, V (f16): 640.00 MiB llama_context: pipeline parallelism enabled (n_copies=4) llama_context: CUDA0 compute buffer size = 260.01 MiB llama_context: CUDA1 compute buffer size = 346.52 MiB llama_context: CUDA_Host compute buffer size = 48.02 MiB llama_context: graph nodes = 2407 llama_context: graph splits = 3 time=2025-05-24T19:14:12.757-05:00 level=INFO source=server.go:630 msg="llama runner started in 12.03 seconds" [GIN] 2025/05/24 - 19:14:13 | 200 | 13.3879384s | 10.0.0.25 | POST "/api/chat" ``` I have restarted the Ollama service and still get the same issue. The only thing I have not tried yet is reboot the computer.
Author
Owner

@MarkWard0110 commented on GitHub (May 25, 2025):

The Mistral memory handling is still not right. Ollama 0.7.1

llama3 70b 39GB 100% GPU but Mistral at 38GB is 28%CPU?!?

NAME                            ID              SIZE     PROCESSOR    UNTIL
llama3.3:70b-instruct-q3_K_M    151348be3103    39 GB    100% GPU     Forever

NAME                                       ID              SIZE     PROCESSOR          UNTIL
mistral-small3.1:24b-instruct-2503-q8_0    79252b8a3eb5    38 GB    28%/72% CPU/GPU    Forever

when mistral is loaded

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 576.52                 Driver Version: 576.52         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:01:00.0 Off |                  N/A |
|  0%   42C    P8              6W /  370W |   12180MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4070 ...  WDDM  |   00000000:08:00.0 Off |                  N/A |
|  0%   39C    P8              6W /  285W |   12139MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           40148      C   ...al\Programs\Ollama\ollama.exe      N/A      |
|    1   N/A  N/A           40148      C   ...al\Programs\Ollama\ollama.exe      N/A      |
+-----------------------------------------------------------------------------------------+

when llama is loade

Sat May 24 19:24:35 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 576.52                 Driver Version: 576.52         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:01:00.0 Off |                  N/A |
|  0%   47C    P2            112W /  370W |   19554MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4070 ...  WDDM  |   00000000:08:00.0 Off |                  N/A |
|  0%   42C    P2             53W /  285W |   13159MiB /  16376MiB |     88%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           40156      C   ...al\Programs\Ollama\ollama.exe      N/A      |
|    1   N/A  N/A           40156      C   ...al\Programs\Ollama\ollama.exe      N/A      |
+-----------------------------------------------------------------------------------------+
<!-- gh-comment-id:2907522497 --> @MarkWard0110 commented on GitHub (May 25, 2025): The Mistral memory handling is still not right. Ollama 0.7.1 llama3 70b 39GB 100% GPU but Mistral at 38GB is 28%CPU?!? ``` NAME ID SIZE PROCESSOR UNTIL llama3.3:70b-instruct-q3_K_M 151348be3103 39 GB 100% GPU Forever NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b-instruct-2503-q8_0 79252b8a3eb5 38 GB 28%/72% CPU/GPU Forever ``` when mistral is loaded ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 576.52 Driver Version: 576.52 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:01:00.0 Off | N/A | | 0% 42C P8 6W / 370W | 12180MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4070 ... WDDM | 00000000:08:00.0 Off | N/A | | 0% 39C P8 6W / 285W | 12139MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 40148 C ...al\Programs\Ollama\ollama.exe N/A | | 1 N/A N/A 40148 C ...al\Programs\Ollama\ollama.exe N/A | +-----------------------------------------------------------------------------------------+ ``` when llama is loade ``` Sat May 24 19:24:35 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 576.52 Driver Version: 576.52 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:01:00.0 Off | N/A | | 0% 47C P2 112W / 370W | 19554MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4070 ... WDDM | 00000000:08:00.0 Off | N/A | | 0% 42C P2 53W / 285W | 13159MiB / 16376MiB | 88% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 40156 C ...al\Programs\Ollama\ollama.exe N/A | | 1 N/A N/A 40156 C ...al\Programs\Ollama\ollama.exe N/A | +-----------------------------------------------------------------------------------------+ ```
Author
Owner

@MarkWard0110 commented on GitHub (May 25, 2025):

It seems I can load the model at its MAX context size

NAME                                         ID              SIZE     PROCESSOR          UNTIL
mistral-small3.1:24b-instruct-2503-q4_K_M    b9aaf0c2586a    61 GB    61%/39% CPU/GPU    Forever

But I can't load it at any other context size

<!-- gh-comment-id:2907533975 --> @MarkWard0110 commented on GitHub (May 25, 2025): It seems I can load the model at its MAX context size ``` NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b-instruct-2503-q4_K_M b9aaf0c2586a 61 GB 61%/39% CPU/GPU Forever ``` But I can't load it at any other context size
Author
Owner

@MarkWard0110 commented on GitHub (Jun 5, 2025):

Mistral still loading weird with 0.9.0. This time with a 5090 + 3090 and the q8 model

time=2025-06-05T17:53:35.530-05:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:o:\\ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-06-05T17:53:35.536-05:00 level=INFO source=images.go:479 msg="total blobs: 129"
time=2025-06-05T17:53:35.538-05:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0"
time=2025-06-05T17:53:35.539-05:00 level=INFO source=routes.go:1287 msg="Listening on [::]:11434 (version 0.9.0)"
time=2025-06-05T17:53:35.539-05:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler"
time=2025-06-05T17:53:35.539-05:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-06-05T17:53:35.539-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-06-05T17:53:35.539-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-06-05T17:53:35.539-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32
time=2025-06-05T17:53:35.539-05:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-06-05T17:53:35.539-05:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll
time=2025-06-05T17:53:35.539-05:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files\\Volta\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvml.dll C:\\Program Files\\Amazon\\AWSCLIV2\\nvml.dll F:\\software\\terraform\\nvml.dll C:\\Program Files\\GitHub CLI\\nvml.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvml.dll C:\\Program Files\\PowerShell\\7\\nvml.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Volta\\bin\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin\\nvml.dll C:\\Users\\wardm\\.dotnet\\tools\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-06-05T17:53:35.540-05:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll"
time=2025-06-05T17:53:35.540-05:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2025-06-05T17:53:35.550-05:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2025-06-05T17:53:35.550-05:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll
time=2025-06-05T17:53:35.550-05:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files\\Volta\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\Amazon\\AWSCLIV2\\nvcuda.dll F:\\software\\terraform\\nvcuda.dll C:\\Program Files\\GitHub CLI\\nvcuda.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvcuda.dll C:\\Program Files\\PowerShell\\7\\nvcuda.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Volta\\bin\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin\\nvcuda.dll C:\\Users\\wardm\\.dotnet\\tools\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]"
time=2025-06-05T17:53:35.552-05:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll"
time=2025-06-05T17:53:35.553-05:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
initializing C:\Windows\system32\nvcuda.dll
dlsym: cuInit - 00007FF9C2411F80
dlsym: cuDriverGetVersion - 00007FF9C2412020
dlsym: cuDeviceGetCount - 00007FF9C2412816
dlsym: cuDeviceGet - 00007FF9C2412810
dlsym: cuDeviceGetAttribute - 00007FF9C2412170
dlsym: cuDeviceGetUuid - 00007FF9C2412822
dlsym: cuDeviceGetName - 00007FF9C241281C
dlsym: cuCtxCreate_v3 - 00007FF9C2412894
dlsym: cuMemGetInfo_v2 - 00007FF9C2412996
dlsym: cuCtxDestroy - 00007FF9C24128A6
calling cuInit
calling cuDriverGetVersion
raw version 0x2f3a
CUDA driver version: 12.9
calling cuDeviceGetCount
device count 2
time=2025-06-05T17:53:35.564-05:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
[GPU-32fda0b3-4602-83bb-0be7-24ef41847cda] CUDA totalMem 32606mb
[GPU-32fda0b3-4602-83bb-0be7-24ef41847cda] CUDA freeMem 30843mb
[GPU-32fda0b3-4602-83bb-0be7-24ef41847cda] Compute Capability 12.0
time=2025-06-05T17:53:35.688-05:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda library=cuda compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB"
[GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5] CUDA totalMem 24575mb
[GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5] CUDA freeMem 23306mb
[GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5] Compute Capability 8.6
time=2025-06-05T17:53:35.757-05:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 library=cuda compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB"
time=2025-06-05T17:53:35.758-05:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found."
releasing cuda driver library
releasing nvml library
time=2025-06-05T17:53:35.758-05:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" total="31.8 GiB" available="30.1 GiB"
time=2025-06-05T17:53:35.759-05:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB"
time=2025-06-05T17:53:48.062-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:53:48.063-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="67.9 GiB" before.free_swap="65.5 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB"
time=2025-06-05T17:53:48.094-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:53:48.094-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:53:48.104-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:53:48.116-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:53:48.117-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:53:48.117-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]"
time=2025-06-05T17:53:48.117-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB"
time=2025-06-05T17:53:48.157-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:53:48.157-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:53:48.159-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]"
time=2025-06-05T17:53:48.159-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB"
time=2025-06-05T17:53:48.188-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:53:48.188-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:53:48.189-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:53:48.189-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB"
time=2025-06-05T17:53:48.220-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:53:48.220-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:53:48.220-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 library=cuda parallel=1 required="41.7 GiB"
time=2025-06-05T17:53:48.220-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB"
time=2025-06-05T17:53:48.251-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:53:48.251-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:53:48.252-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.0 GiB" free_swap="65.7 GiB"
time=2025-06-05T17:53:48.252-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:53:48.252-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB"
time=2025-06-05T17:53:48.279-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:53:48.279-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:53:48.280-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="41.7 GiB" memory.required.partial="41.7 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[25.9 GiB 15.8 GiB]" memory.weights.total="22.8 GiB" memory.weights.repeating="22.2 GiB" memory.weights.nonrepeating="680.0 MiB" memory.graph.full="2.1 GiB" memory.graph.partial="2.1 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-06-05T17:53:48.280-05:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-06-05T17:53:48.280-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
time=2025-06-05T17:53:48.280-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-06-05T17:53:48.307-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-06-05T17:53:48.308-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12]
time=2025-06-05T17:53:48.308-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 --ctx-size 20225 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52855"
time=2025-06-05T17:53:48.308-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5
time=2025-06-05T17:53:48.310-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-05T17:53:48.310-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-06-05T17:53:48.312-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-06-05T17:53:48.337-05:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-06-05T17:53:48.337-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52855"
time=2025-06-05T17:53:48.356-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:53:48.357-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default=""
time=2025-06-05T17:53:48.357-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default=""
time=2025-06-05T17:53:48.357-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q8_0 name="" description="" num_tensors=585 num_key_values=43
time=2025-06-05T17:53:48.357-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-06-05T17:53:48.366-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-06-05T17:53:48.497-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-06-05T17:53:48.563-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-06-05T17:53:48.591-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="680.0 MiB"
time=2025-06-05T17:53:48.591-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="11.6 GiB"
time=2025-06-05T17:53:48.591-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="12.0 GiB"
time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-06-05T17:53:48.778-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1
time=2025-06-05T17:53:48.778-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-06-05T17:53:48.778-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-06-05T17:53:48.778-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-06-05T17:53:49.093-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1265 splits=4
time=2025-06-05T17:53:49.093-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="332.7 MiB"
time=2025-06-05T17:53:49.093-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-06-05T17:53:49.093-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 MiB"
time=2025-06-05T17:53:49.094-05:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=713031680A allocated.CPU.Graph=10485760A allocated.CUDA0.Weights="[595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Cache="[83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Graph=348839936A allocated.CUDA1.Weights="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 1591070720A]" allocated.CUDA1.Cache="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 0U]" allocated.CUDA1.Graph=9791055360A
time=2025-06-05T17:53:49.316-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.09"
time=2025-06-05T17:53:49.566-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.17"
time=2025-06-05T17:53:49.817-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.22"
time=2025-06-05T17:53:50.067-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.27"
time=2025-06-05T17:53:50.318-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.33"
time=2025-06-05T17:53:50.569-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.38"
time=2025-06-05T17:53:50.819-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.44"
time=2025-06-05T17:53:51.070-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.51"
time=2025-06-05T17:53:51.320-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.61"
time=2025-06-05T17:53:51.571-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.72"
time=2025-06-05T17:53:51.821-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.81"
time=2025-06-05T17:53:52.072-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.91"
time=2025-06-05T17:53:52.322-05:00 level=INFO source=server.go:630 msg="llama runner started in 4.01 seconds"
time=2025-06-05T17:53:52.322-05:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:53:52.346-05:00 level=DEBUG source=server.go:729 msg="completion request" images=0 prompt=1600 format=""
time=2025-06-05T17:53:52.360-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[1]
time=2025-06-05T17:53:52.360-05:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=365 used=0 remaining=365
[GIN] 2025/06/05 - 17:53:52 | 200 |    4.8473179s |       10.0.0.25 | POST     "/api/chat"
time=2025-06-05T17:53:52.897-05:00 level=DEBUG source=sched.go:503 msg="context for request finished"
time=2025-06-05T17:53:52.897-05:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 duration=2562047h47m16.854775807s
time=2025-06-05T17:53:52.897-05:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 refCount=0
[GIN] 2025/06/05 - 17:53:53 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:53:53 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:53:55 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:53:55 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:53:57 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:53:57 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:00 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:00 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:02 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:02 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:04 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:04 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:06 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:06 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:08 | 200 |         538µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:08 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:10 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:10 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:12 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:12 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:15 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:15 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:17 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:17 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:19 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:19 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:54:20.987-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:151 msg=reloading runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:287 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 refCount=0
time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:300 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:364 msg="runner expired event received" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:379 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:402 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="66.7 GiB" now.free_swap="27.0 GiB"
time=2025-06-05T17:54:21.014-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="15.9 GiB" now.used="14.7 GiB"
time=2025-06-05T17:54:21.014-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="17179869183.8 GiB" now.used="23.2 GiB"
releasing nvml library
time=2025-06-05T17:54:21.015-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=18128
time=2025-06-05T17:54:21.015-05:00 level=DEBUG source=server.go:1029 msg="waiting for llama server to exit" pid=18128
time=2025-06-05T17:54:21.239-05:00 level=DEBUG source=server.go:1033 msg="llama server stopped" pid=18128
time=2025-06-05T17:54:21.239-05:00 level=DEBUG source=sched.go:407 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:21.266-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="66.7 GiB" before.free_swap="27.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:21.302-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="15.9 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:21.302-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="17179869183.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:21.303-05:00 level=DEBUG source=sched.go:700 msg="gpu VRAM free memory converged after 0.31 seconds" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:21.303-05:00 level=DEBUG source=sched.go:410 msg="sending an unloaded event" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:21.303-05:00 level=DEBUG source=sched.go:306 msg="unload completed" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:21.303-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:21.332-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:21.332-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:21.342-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:21.359-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:21.360-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:21.360-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]"
time=2025-06-05T17:54:21.361-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:21.398-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:21.398-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:21.399-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]"
time=2025-06-05T17:54:21.400-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:21.429-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:21.429-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:21.429-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:54:21.430-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:21.460-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:21.460-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:21.461-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 library=cuda parallel=1 required="46.1 GiB"
time=2025-06-05T17:54:21.461-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:21.487-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:21.487-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:21.488-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.2 GiB" free_swap="65.9 GiB"
time=2025-06-05T17:54:21.488-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:54:21.488-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:21.511-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:21.511-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:21.512-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="46.1 GiB" memory.required.partial="46.1 GiB" memory.required.kv="4.9 GiB" memory.required.allocations="[28.1 GiB 18.0 GiB]" memory.weights.total="22.8 GiB" memory.weights.repeating="22.2 GiB" memory.weights.nonrepeating="680.0 MiB" memory.graph.full="3.3 GiB" memory.graph.partial="3.3 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-06-05T17:54:21.512-05:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-06-05T17:54:21.512-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
time=2025-06-05T17:54:21.512-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
time=2025-06-05T17:54:21.536-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-06-05T17:54:21.542-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-06-05T17:54:21.542-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12]
time=2025-06-05T17:54:21.542-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 --ctx-size 32225 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52880"
time=2025-06-05T17:54:21.542-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5
time=2025-06-05T17:54:21.545-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-05T17:54:21.545-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-06-05T17:54:21.545-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-06-05T17:54:21.573-05:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-06-05T17:54:21.573-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52880"
time=2025-06-05T17:54:21.592-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:21.592-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default=""
time=2025-06-05T17:54:21.592-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default=""
time=2025-06-05T17:54:21.592-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q8_0 name="" description="" num_tensors=585 num_key_values=43
time=2025-06-05T17:54:21.593-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-06-05T17:54:21.601-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-06-05T17:54:21.729-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
[GIN] 2025/06/05 - 17:54:21 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:21 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:54:21.797-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-06-05T17:54:21.831-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="12.0 GiB"
time=2025-06-05T17:54:21.831-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="680.0 MiB"
time=2025-06-05T17:54:21.831-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="11.6 GiB"
time=2025-06-05T17:54:21.832-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:54:21.832-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:54:21.833-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:54:21.833-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:54:21.833-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:54:21.833-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-06-05T17:54:22.021-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1
time=2025-06-05T17:54:22.021-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-06-05T17:54:22.022-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-06-05T17:54:22.022-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 63.00 MiB on device 1: cudaMalloc failed: out of memory
panic: insufficient memory - required allocations: {InputWeights:713031680A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} GPUs:[{Name:CUDA0 Weights:[595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 1591070720A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576F 0U 0U 0U] Graph:9791055360A}]}

goroutine 29 [running]:
github.com/ollama/ollama/ml/backend/ggml.(*Context).newTensor(0xc001088840, 0x1d0e3b13e20?, {0xc001086168, 0x3, 0x7ff7a21bbe02?})
	C:/a/ollama/ollama/ml/backend/ggml/ggml.go:714 +0x696
github.com/ollama/ollama/ml/backend/ggml.(*Context).Zeros(0x7ff7a32c6ec0?, 0xc001118840?, {0xc001086168?, 0xc000056508?, 0x7ff7a2205009?})
	C:/a/ollama/ollama/ml/backend/ggml/ggml.go:727 +0x1c
github.com/ollama/ollama/kvcache.(*Causal).Put(0xc001116960, {0x7ff7a35ca8c8, 0xc0011dcb00}, {0x7ff7a35d6a48, 0xc0011700d8}, {0x7ff7a35d6a48, 0xc001170120})
	C:/a/ollama/ollama/kvcache/causal.go:566 +0x4cc
github.com/ollama/ollama/ml/nn.Attention({0x7ff7a35ca8c8, 0xc0011dcb00}, {0x7ff7a35d6a48, 0xc001170078}, {0x7ff7a35d6a48, 0xc0011700d8}, {0x7ff7a35d6a48, 0xc001170120}, 0x3fb6a09e667f3bcc, {0x7ff7a35c9a00, ...})
	C:/a/ollama/ollama/ml/nn/attention.go:39 +0x1c3
github.com/ollama/ollama/model/models/mistral3.(*SelfAttention).Forward(0xc001173d60, {0x7ff7a35ca8c8, 0xc0011dcb00}, {0x7ff7a35d6a48, 0xc001170000}, {0x7ff7a35d6a48, 0xc00114a210}, {0x7ff7a35c9a00, 0xc001116960}, 0xc001146f00)
	C:/a/ollama/ollama/model/models/mistral3/model_text.go:52 +0x3f3
github.com/ollama/ollama/model/models/mistral3.(*Layer).Forward(0xc00004b8e0, {0x7ff7a35ca8c8, 0xc0011dcb00}, {0x7ff7a35d6a48, 0xc00116bf68}, {0x7ff7a35d6a48, 0xc00114a210}, {0x0, 0x0}, {0x7ff7a35c9a00, ...}, ...)
	C:/a/ollama/ollama/model/models/mistral3/model_text.go:83 +0xd9
github.com/ollama/ollama/model/models/mistral3.(*TextModel).Forward(0xc001144580, {0x7ff7a35ca8c8, 0xc0011dcb00}, {0x7ff7a35d6a48?, 0xc00114a1c8?}, {0x7ff7a35d6a48, 0xc00114a210}, {0x7ff7a35d6a48, 0xc00114a228}, {{0x7ff7a35d6a48, ...}, ...}, ...)
	C:/a/ollama/ollama/model/models/mistral3/model_text.go:117 +0x3aa
github.com/ollama/ollama/model/models/mistral3.(*Model).Forward(0xc00038f0a0, {0x7ff7a35ca8c8, 0xc0011dcb00}, {{0x7ff7a35d6a48, 0xc00114a1c8}, {0xc000140600, 0x9, 0x10}, {0xc000330800, 0x200, ...}, ...})
	C:/a/ollama/ollama/model/models/mistral3/model.go:164 +0x1d7
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc0006366c0)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:821 +0xac5
github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc0006366c0, {0xc0000de000?, 0x0?}, {0x8, 0x0, 0x29, {0xc00037fbf8, 0x2, 0x2}, 0x1}, ...)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270
github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc0006366c0, {0x7ff7a35c2540, 0xc000146410}, {0xc0000de000?, 0x0?}, {0x8, 0x0, 0x29, {0xc00037fbf8, 0x2, ...}, ...}, ...)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11
time=2025-06-05T17:54:22.512-05:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
time=2025-06-05T17:54:22.548-05:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory"
time=2025-06-05T17:54:22.548-05:00 level=DEBUG source=sched.go:491 msg="triggering expiration for failed load" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=32225
time=2025-06-05T17:54:22.548-05:00 level=DEBUG source=sched.go:364 msg="runner expired event received" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=32225
time=2025-06-05T17:54:22.548-05:00 level=DEBUG source=sched.go:379 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=32225
time=2025-06-05T17:54:22.548-05:00 level=DEBUG source=sched.go:402 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=32225
time=2025-06-05T17:54:22.548-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
[GIN] 2025/06/05 - 17:54:22 | 500 |    1.5788971s |       10.0.0.25 | POST     "/api/chat"
time=2025-06-05T17:54:22.584-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:22.584-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:22.584-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=41896
time=2025-06-05T17:54:22.584-05:00 level=DEBUG source=sched.go:407 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:22.835-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:22.873-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:22.873-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:23.084-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.1 GiB" now.free_swap="65.8 GiB"
time=2025-06-05T17:54:23.115-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:23.115-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:23.334-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.1 GiB" before.free_swap="65.8 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:23.367-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:23.367-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:23.584-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:23.602-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:23.602-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:23.835-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:23.862-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:23.862-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
[GIN] 2025/06/05 - 17:54:23 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:23 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:54:24.085-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:24.123-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:24.123-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:24.337-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:24.361-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:24.361-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:24.585-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:24.614-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:24.614-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:24.835-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:24.867-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:24.867-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:25.085-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:25.111-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:25.111-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:25.335-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:25.373-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:25.373-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:25.585-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:25.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:25.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:25.835-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:25.866-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:25.866-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:26.085-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:26.108-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:26.110-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:26.110-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:26.110-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:26.141-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:26.141-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:26.151-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:26.161-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:26.162-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:26.162-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]"
time=2025-06-05T17:54:26.163-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:26.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:26.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:26.204-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]"
time=2025-06-05T17:54:26.204-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
[GIN] 2025/06/05 - 17:54:26 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:26 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:54:26.235-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:26.235-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:26.235-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:54:26.235-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:26.264-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:26.264-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:26.265-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 library=cuda parallel=1 required="41.7 GiB"
time=2025-06-05T17:54:26.265-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:26.296-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:26.296-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:26.297-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.2 GiB" free_swap="65.9 GiB"
time=2025-06-05T17:54:26.297-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:54:26.297-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:26.327-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:26.327-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:26.328-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="41.7 GiB" memory.required.partial="41.7 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[25.9 GiB 15.8 GiB]" memory.weights.total="22.8 GiB" memory.weights.repeating="22.2 GiB" memory.weights.nonrepeating="680.0 MiB" memory.graph.full="2.1 GiB" memory.graph.partial="2.1 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-06-05T17:54:26.328-05:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-06-05T17:54:26.328-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
time=2025-06-05T17:54:26.328-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
time=2025-06-05T17:54:26.343-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-06-05T17:54:26.366-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-06-05T17:54:26.366-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12]
time=2025-06-05T17:54:26.366-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 --ctx-size 20225 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52886"
time=2025-06-05T17:54:26.366-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5
time=2025-06-05T17:54:26.370-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-05T17:54:26.370-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-06-05T17:54:26.372-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-06-05T17:54:26.373-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:26.373-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:26.398-05:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-06-05T17:54:26.398-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52886"
time=2025-06-05T17:54:26.416-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:26.418-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default=""
time=2025-06-05T17:54:26.418-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default=""
time=2025-06-05T17:54:26.418-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q8_0 name="" description="" num_tensors=585 num_key_values=43
time=2025-06-05T17:54:26.418-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-06-05T17:54:26.427-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-06-05T17:54:26.568-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-06-05T17:54:26.585-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="67.9 GiB" now.free_swap="52.6 GiB"
time=2025-06-05T17:54:26.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="519.2 MiB"
time=2025-06-05T17:54:26.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="21.2 GiB" now.used="1.8 GiB"
releasing nvml library
time=2025-06-05T17:54:26.611-05:00 level=DEBUG source=sched.go:700 msg="gpu VRAM free memory converged after 4.06 seconds" runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:26.611-05:00 level=DEBUG source=sched.go:410 msg="sending an unloaded event" runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:26.611-05:00 level=DEBUG source=sched.go:312 msg="ignoring unload event with no pending requests"
time=2025-06-05T17:54:26.623-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-06-05T17:54:26.657-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="12.0 GiB"
time=2025-06-05T17:54:26.657-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="680.0 MiB"
time=2025-06-05T17:54:26.657-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="11.6 GiB"
time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-06-05T17:54:26.846-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1
time=2025-06-05T17:54:26.847-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-06-05T17:54:26.847-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-06-05T17:54:26.847-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-06-05T17:54:27.158-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1265 splits=4
time=2025-06-05T17:54:27.158-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="332.7 MiB"
time=2025-06-05T17:54:27.158-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-06-05T17:54:27.158-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 MiB"
time=2025-06-05T17:54:27.159-05:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=713031680A allocated.CPU.Graph=10485760A allocated.CUDA0.Weights="[595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Cache="[83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Graph=348839936A allocated.CUDA1.Weights="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 1591070720A]" allocated.CUDA1.Cache="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 0U]" allocated.CUDA1.Graph=9791055360A
time=2025-06-05T17:54:27.373-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.08"
time=2025-06-05T17:54:27.624-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.17"
time=2025-06-05T17:54:27.874-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.22"
time=2025-06-05T17:54:28.125-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.28"
time=2025-06-05T17:54:28.375-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.33"
[GIN] 2025/06/05 - 17:54:28 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:28 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:54:28.626-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.38"
time=2025-06-05T17:54:28.876-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.44"
time=2025-06-05T17:54:29.126-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.50"
time=2025-06-05T17:54:29.377-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.61"
time=2025-06-05T17:54:29.627-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.71"
time=2025-06-05T17:54:29.878-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.81"
time=2025-06-05T17:54:30.128-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.91"
time=2025-06-05T17:54:30.379-05:00 level=INFO source=server.go:630 msg="llama runner started in 4.01 seconds"
time=2025-06-05T17:54:30.379-05:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:54:30.401-05:00 level=DEBUG source=server.go:729 msg="completion request" images=0 prompt=1600 format=""
time=2025-06-05T17:54:30.415-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[1]
time=2025-06-05T17:54:30.415-05:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=365 used=0 remaining=365
[GIN] 2025/06/05 - 17:54:30 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:30 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:30 | 200 |    4.8496251s |       10.0.0.25 | POST     "/api/chat"
time=2025-06-05T17:54:30.945-05:00 level=DEBUG source=sched.go:503 msg="context for request finished"
time=2025-06-05T17:54:30.945-05:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 duration=2562047h47m16.854775807s
time=2025-06-05T17:54:30.945-05:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 refCount=0
[GIN] 2025/06/05 - 17:54:33 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:33 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:35 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:35 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:54:36.034-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:151 msg=reloading runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:287 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 refCount=0
time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:300 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:364 msg="runner expired event received" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:379 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:402 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225
time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="67.9 GiB" before.free_swap="52.6 GiB" now.total="95.7 GiB" now.free="66.8 GiB" now.free_swap="27.1 GiB"
time=2025-06-05T17:54:36.072-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="15.9 GiB" now.used="14.7 GiB"
time=2025-06-05T17:54:36.072-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="21.2 GiB" now.total="24.0 GiB" now.free="17179869183.8 GiB" now.used="23.2 GiB"
releasing nvml library
time=2025-06-05T17:54:36.073-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=42340
time=2025-06-05T17:54:36.074-05:00 level=DEBUG source=server.go:1029 msg="waiting for llama server to exit" pid=42340
time=2025-06-05T17:54:36.283-05:00 level=DEBUG source=server.go:1033 msg="llama server stopped" pid=42340
time=2025-06-05T17:54:36.283-05:00 level=DEBUG source=sched.go:407 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:36.325-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="66.8 GiB" before.free_swap="27.1 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:36.359-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="15.9 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:36.359-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="17179869183.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:36.360-05:00 level=DEBUG source=sched.go:700 msg="gpu VRAM free memory converged after 0.32 seconds" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:36.360-05:00 level=DEBUG source=sched.go:410 msg="sending an unloaded event" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:36.360-05:00 level=DEBUG source=sched.go:306 msg="unload completed" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:36.360-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:36.390-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:36.390-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:36.400-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:36.411-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:36.411-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:36.411-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]"
time=2025-06-05T17:54:36.412-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:36.438-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:36.439-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:36.439-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]"
time=2025-06-05T17:54:36.440-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:36.470-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:36.470-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:36.471-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:54:36.471-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:36.501-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:36.501-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:36.501-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 library=cuda parallel=1 required="50.1 GiB"
time=2025-06-05T17:54:36.501-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:36.535-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:36.535-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:36.535-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.2 GiB" free_swap="65.9 GiB"
time=2025-06-05T17:54:36.535-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:54:36.535-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:36.564-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:36.564-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:36.565-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="50.1 GiB" memory.required.partial="50.1 GiB" memory.required.kv="6.6 GiB" memory.required.allocations="[30.1 GiB 20.0 GiB]" memory.weights.total="22.8 GiB" memory.weights.repeating="22.2 GiB" memory.weights.nonrepeating="680.0 MiB" memory.graph.full="4.4 GiB" memory.graph.partial="4.4 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-06-05T17:54:36.565-05:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-06-05T17:54:36.565-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
time=2025-06-05T17:54:36.565-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
time=2025-06-05T17:54:36.596-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-06-05T17:54:36.603-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-06-05T17:54:36.603-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12]
time=2025-06-05T17:54:36.603-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 --ctx-size 43225 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52894"
time=2025-06-05T17:54:36.603-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5
time=2025-06-05T17:54:36.605-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-05T17:54:36.606-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-06-05T17:54:36.606-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-06-05T17:54:36.632-05:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-06-05T17:54:36.632-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52894"
time=2025-06-05T17:54:36.651-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:36.652-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default=""
time=2025-06-05T17:54:36.652-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default=""
time=2025-06-05T17:54:36.652-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q8_0 name="" description="" num_tensors=585 num_key_values=43
time=2025-06-05T17:54:36.652-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-06-05T17:54:36.660-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-06-05T17:54:36.788-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-06-05T17:54:36.857-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-06-05T17:54:36.884-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="12.0 GiB"
time=2025-06-05T17:54:36.884-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="680.0 MiB"
time=2025-06-05T17:54:36.884-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="11.6 GiB"
time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-06-05T17:54:37.076-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1
time=2025-06-05T17:54:37.076-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-06-05T17:54:37.076-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-06-05T17:54:37.076-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
[GIN] 2025/06/05 - 17:54:37 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:37 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 84.50 MiB on device 1: cudaMalloc failed: out of memory
panic: insufficient memory - required allocations: {InputWeights:713031680A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} GPUs:[{Name:CUDA0 Weights:[595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 1591070720A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344F 0U 0U 0U 0U 0U 0U 0U] Graph:9791055360A}]}

goroutine 66 [running]:
github.com/ollama/ollama/ml/backend/ggml.(*Context).newTensor(0xc0002ea380, 0x1881be03720?, {0xc0012302b8, 0x3, 0x7ff7a21bbe02?})
	C:/a/ollama/ollama/ml/backend/ggml/ggml.go:714 +0x696
github.com/ollama/ollama/ml/backend/ggml.(*Context).Zeros(0x7ff7a32c6ec0?, 0xc001242ea0?, {0xc0012302b8?, 0xc00005f908?, 0x7ff7a2205009?})
	C:/a/ollama/ollama/ml/backend/ggml/ggml.go:727 +0x1c
github.com/ollama/ollama/kvcache.(*Causal).Put(0xc00122aa50, {0x7ff7a35ca8c8, 0xc0011a3cc0}, {0x7ff7a35d6a48, 0xc0010bb1e8}, {0x7ff7a35d6a48, 0xc0010bb218})
	C:/a/ollama/ollama/kvcache/causal.go:566 +0x4cc
github.com/ollama/ollama/ml/nn.Attention({0x7ff7a35ca8c8, 0xc0011a3cc0}, {0x7ff7a35d6a48, 0xc0010bb170}, {0x7ff7a35d6a48, 0xc0010bb1e8}, {0x7ff7a35d6a48, 0xc0010bb218}, 0x3fb6a09e667f3bcc, {0x7ff7a35c9a00, ...})
	C:/a/ollama/ollama/ml/nn/attention.go:39 +0x1c3
github.com/ollama/ollama/model/models/mistral3.(*SelfAttention).Forward(0xc00034ff40, {0x7ff7a35ca8c8, 0xc0011a3cc0}, {0x7ff7a35d6a48, 0xc0010bb110}, {0x7ff7a35d6a48, 0xc000009dd0}, {0x7ff7a35c9a00, 0xc00122aa50}, 0xc0010aef00)
	C:/a/ollama/ollama/model/models/mistral3/model_text.go:52 +0x3f3
github.com/ollama/ollama/model/models/mistral3.(*Layer).Forward(0xc00004b8e0, {0x7ff7a35ca8c8, 0xc0011a3cc0}, {0x7ff7a35d6a48, 0xc0010bb0f8}, {0x7ff7a35d6a48, 0xc000009dd0}, {0x0, 0x0}, {0x7ff7a35c9a00, ...}, ...)
	C:/a/ollama/ollama/model/models/mistral3/model_text.go:83 +0xd9
github.com/ollama/ollama/model/models/mistral3.(*TextModel).Forward(0xc00127f740, {0x7ff7a35ca8c8, 0xc0011a3cc0}, {0x7ff7a35d6a48?, 0xc000009da0?}, {0x7ff7a35d6a48, 0xc000009dd0}, {0x7ff7a35d6a48, 0xc000009de8}, {{0x7ff7a35d6a48, ...}, ...}, ...)
	C:/a/ollama/ollama/model/models/mistral3/model_text.go:117 +0x3aa
github.com/ollama/ollama/model/models/mistral3.(*Model).Forward(0xc000454f50, {0x7ff7a35ca8c8, 0xc0011a3cc0}, {{0x7ff7a35d6a48, 0xc000009da0}, {0xc000166600, 0x9, 0x10}, {0xc000e7a800, 0x200, ...}, ...})
	C:/a/ollama/ollama/model/models/mistral3/model.go:164 +0x1d7
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc00054e000)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:821 +0xac5
github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc00054e000, {0xc0000de000?, 0x0?}, {0x8, 0x0, 0x29, {0xc0003c6268, 0x2, 0x2}, 0x1}, ...)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270
github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc00054e000, {0x7ff7a35c2540, 0xc000550000}, {0xc0000de000?, 0x0?}, {0x8, 0x0, 0x29, {0xc0003c6268, 0x2, ...}, ...}, ...)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11
time=2025-06-05T17:54:37.641-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-06-05T17:54:37.669-05:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
time=2025-06-05T17:54:37.891-05:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory"
time=2025-06-05T17:54:37.891-05:00 level=DEBUG source=sched.go:491 msg="triggering expiration for failed load" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=43225
time=2025-06-05T17:54:37.891-05:00 level=DEBUG source=sched.go:364 msg="runner expired event received" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=43225
time=2025-06-05T17:54:37.891-05:00 level=DEBUG source=sched.go:379 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=43225
time=2025-06-05T17:54:37.891-05:00 level=DEBUG source=sched.go:402 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=43225
time=2025-06-05T17:54:37.891-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
[GIN] 2025/06/05 - 17:54:37 | 500 |    1.8841691s |       10.0.0.25 | POST     "/api/chat"
time=2025-06-05T17:54:37.917-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:37.917-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:37.917-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=24780
time=2025-06-05T17:54:37.917-05:00 level=DEBUG source=sched.go:407 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:38.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.8 GiB"
time=2025-06-05T17:54:38.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:38.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:38.418-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.8 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:38.485-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:38.485-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:38.669-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:38.705-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:38.705-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:38.918-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:38.955-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:38.955-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:39.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:39.188-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:39.188-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
[GIN] 2025/06/05 - 17:54:39 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:39 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:54:39.418-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:39.452-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:39.452-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:39.668-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:39.703-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:39.703-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:39.918-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:39.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:39.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:40.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:40.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:40.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:40.418-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:40.468-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:40.468-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:40.668-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:40.704-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:40.704-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:40.918-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:40.951-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:40.951-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:41.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:41.205-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:41.205-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:41.419-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:41.453-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:41.453-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
[GIN] 2025/06/05 - 17:54:41 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:41 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:54:41.668-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:41.703-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:41.703-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:41.918-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:41.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:41.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:42.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:42.202-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:42.202-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:42.418-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:42.457-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:42.457-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:42.668-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:42.701-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:42.701-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:42.918-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0263834 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:42.918-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:42.921-05:00 level=DEBUG source=sched.go:410 msg="sending an unloaded event" runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:42.921-05:00 level=DEBUG source=sched.go:312 msg="ignoring unload event with no pending requests"
time=2025-06-05T17:54:42.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:42.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:43.168-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2763602 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:43.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:43.202-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:43.202-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:43.418-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5263857 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0
time=2025-06-05T17:54:43.521-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:43.523-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:43.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:43.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
[GIN] 2025/06/05 - 17:54:43 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:43 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:54:43.622-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:43.632-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:43.633-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-06-05T17:54:43.633-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]"
time=2025-06-05T17:54:43.633-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:43.673-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:43.673-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:43.674-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]"
time=2025-06-05T17:54:43.674-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:43.705-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:43.705-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:43.706-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:54:43.707-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:43.737-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:43.737-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:43.737-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc library=cuda parallel=1 required="39.9 GiB"
time=2025-06-05T17:54:43.737-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:43.764-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:43.764-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:43.766-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.2 GiB" free_swap="65.9 GiB"
time=2025-06-05T17:54:43.766-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:54:43.766-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:54:43.795-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:54:43.795-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:54:43.796-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.9 GiB" memory.required.partial="39.9 GiB" memory.required.kv="6.6 GiB" memory.required.allocations="[24.9 GiB 15.0 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="4.4 GiB" memory.graph.partial="4.4 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-06-05T17:54:43.796-05:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-06-05T17:54:43.796-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
time=2025-06-05T17:54:43.796-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
time=2025-06-05T17:54:43.819-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-06-05T17:54:43.824-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-06-05T17:54:43.824-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12]
time=2025-06-05T17:54:43.824-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 43225 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52903"
time=2025-06-05T17:54:43.824-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5
time=2025-06-05T17:54:43.827-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-05T17:54:43.827-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-06-05T17:54:43.828-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-06-05T17:54:43.852-05:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-06-05T17:54:43.853-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52903"
time=2025-06-05T17:54:43.871-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:54:43.872-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default=""
time=2025-06-05T17:54:43.872-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default=""
time=2025-06-05T17:54:43.873-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43
time=2025-06-05T17:54:43.873-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-06-05T17:54:43.881-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-06-05T17:54:44.004-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-06-05T17:54:44.080-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-06-05T17:54:44.093-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="7.2 GiB"
time=2025-06-05T17:54:44.095-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="525.0 MiB"
time=2025-06-05T17:54:44.095-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="6.7 GiB"
time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-06-05T17:54:44.284-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1
time=2025-06-05T17:54:44.284-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-06-05T17:54:44.284-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-06-05T17:54:44.284-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-06-05T17:54:44.604-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1265 splits=4
time=2025-06-05T17:54:44.604-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="421.7 MiB"
time=2025-06-05T17:54:44.604-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-06-05T17:54:44.604-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 MiB"
time=2025-06-05T17:54:44.605-05:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=550502400A allocated.CPU.Graph=10485760A allocated.CUDA0.Weights="[363438080A 363438080A 363438080A 363438080A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Cache="[177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Graph=442163200A allocated.CUDA1.Weights="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 363438080A 363438080A 363438080A 363438080A 363438080A 1255526400A]" allocated.CUDA1.Cache="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 0U]" allocated.CUDA1.Graph=9791055360A
time=2025-06-05T17:54:44.832-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.15"
time=2025-06-05T17:54:45.084-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.25"
time=2025-06-05T17:54:45.335-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.34"
time=2025-06-05T17:54:45.586-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.44"
[GIN] 2025/06/05 - 17:54:45 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:45 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:54:45.836-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.56"
time=2025-06-05T17:54:46.086-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.75"
time=2025-06-05T17:54:46.337-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.90"
time=2025-06-05T17:54:46.587-05:00 level=INFO source=server.go:630 msg="llama runner started in 2.76 seconds"
time=2025-06-05T17:54:46.587-05:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225
time=2025-06-05T17:54:46.611-05:00 level=DEBUG source=server.go:729 msg="completion request" images=0 prompt=1600 format=""
time=2025-06-05T17:54:46.624-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[1]
time=2025-06-05T17:54:46.624-05:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=365 used=0 remaining=365
[GIN] 2025/06/05 - 17:54:47 | 200 |    3.6897465s |       10.0.0.25 | POST     "/api/chat"
time=2025-06-05T17:54:47.147-05:00 level=DEBUG source=sched.go:503 msg="context for request finished"
time=2025-06-05T17:54:47.147-05:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 duration=2562047h47m16.854775807s
time=2025-06-05T17:54:47.147-05:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 refCount=0
[GIN] 2025/06/05 - 17:54:47 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:47 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:50 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:50 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:52 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:52 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:54 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:54 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:56 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:56 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:54:59 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:54:59 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:55:01 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:55:01 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:55:02.357-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:151 msg=reloading runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225
time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:287 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 refCount=0
time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:300 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225
time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:364 msg="runner expired event received" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225
time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:379 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225
time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:402 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225
time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="67.0 GiB" now.free_swap="33.4 GiB"
time=2025-06-05T17:55:02.388-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="18.9 GiB" now.used="11.7 GiB"
time=2025-06-05T17:55:02.388-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="2.9 GiB" now.used="20.1 GiB"
releasing nvml library
time=2025-06-05T17:55:02.388-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=9872
time=2025-06-05T17:55:02.389-05:00 level=DEBUG source=server.go:1029 msg="waiting for llama server to exit" pid=9872
time=2025-06-05T17:55:02.585-05:00 level=DEBUG source=server.go:1033 msg="llama server stopped" pid=9872
time=2025-06-05T17:55:02.585-05:00 level=DEBUG source=sched.go:407 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-06-05T17:55:02.639-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="67.0 GiB" before.free_swap="33.4 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:55:02.676-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="18.9 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:02.676-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="2.9 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:02.888-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:55:02.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:02.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:03.138-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:03.172-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:03.172-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:03.389-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:55:03.419-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:03.419-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
[GIN] 2025/06/05 - 17:55:03 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:55:03 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:55:03.638-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:03.672-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:03.672-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:03.889-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:55:03.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:03.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:04.138-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:04.164-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:04.164-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:04.388-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:04.421-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:04.421-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:04.639-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:04.666-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:04.666-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:04.889-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:04.925-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:04.925-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:05.139-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:05.170-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:05.171-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:05.389-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:05.422-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:05.422-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:05.640-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:05.674-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:05.674-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
[GIN] 2025/06/05 - 17:55:05 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:55:05 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:55:05.889-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:05.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:05.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:06.139-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:55:06.171-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:06.171-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:06.388-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:06.419-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:06.419-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:06.638-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:06.672-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:06.672-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:06.889-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:55:06.922-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:06.922-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:07.138-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:07.176-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:07.176-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:07.388-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0305489 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-06-05T17:55:07.388-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:07.391-05:00 level=DEBUG source=sched.go:410 msg="sending an unloaded event" runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-06-05T17:55:07.391-05:00 level=DEBUG source=sched.go:306 msg="unload completed" runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-06-05T17:55:07.422-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:07.422-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:07.423-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:07.454-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:07.454-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:07.464-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:55:07.483-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:55:07.485-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-06-05T17:55:07.485-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]"
time=2025-06-05T17:55:07.485-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:07.516-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:07.516-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:07.517-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]"
time=2025-06-05T17:55:07.517-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:07.547-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:07.547-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:07.548-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:55:07.548-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB"
time=2025-06-05T17:55:07.578-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:07.578-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:07.578-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc library=cuda parallel=1 required="50.4 GiB"
time=2025-06-05T17:55:07.578-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:55:07.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:07.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:07.611-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.2 GiB" free_swap="65.9 GiB"
time=2025-06-05T17:55:07.611-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]"
time=2025-06-05T17:55:07.611-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:55:07.638-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2805774 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-06-05T17:55:07.645-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:07.645-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:07.646-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="50.4 GiB" memory.required.partial="50.4 GiB" memory.required.kv="11.0 GiB" memory.required.allocations="[30.1 GiB 20.3 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="7.3 GiB" memory.graph.partial="7.3 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-06-05T17:55:07.646-05:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-06-05T17:55:07.646-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
time=2025-06-05T17:55:07.646-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
time=2025-06-05T17:55:07.646-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB"
time=2025-06-05T17:55:07.669-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
time=2025-06-05T17:55:07.675-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB"
time=2025-06-05T17:55:07.675-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB"
releasing nvml library
time=2025-06-05T17:55:07.675-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-06-05T17:55:07.675-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12]
time=2025-06-05T17:55:07.675-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 72112 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52916"
time=2025-06-05T17:55:07.675-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5
time=2025-06-05T17:55:07.678-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-05T17:55:07.678-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-06-05T17:55:07.678-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-06-05T17:55:07.704-05:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-06-05T17:55:07.704-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52916"
time=2025-06-05T17:55:07.722-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-06-05T17:55:07.724-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default=""
time=2025-06-05T17:55:07.724-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default=""
time=2025-06-05T17:55:07.724-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43
time=2025-06-05T17:55:07.724-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama
load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-06-05T17:55:07.730-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-06-05T17:55:07.876-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-06-05T17:55:07.889-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5307762 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-06-05T17:55:07.932-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-06-05T17:55:07.964-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="7.2 GiB"
time=2025-06-05T17:55:07.964-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="525.0 MiB"
time=2025-06-05T17:55:07.964-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="6.7 GiB"
time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1
time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540
time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06
[GIN] 2025/06/05 - 17:55:08 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:55:08 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:55:08.153-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1
time=2025-06-05T17:55:08.153-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-06-05T17:55:08.153-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-06-05T17:55:08.153-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-06-05T17:55:08.598-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1265 splits=4
time=2025-06-05T17:55:08.598-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="534.7 MiB"
time=2025-06-05T17:55:08.598-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-06-05T17:55:08.598-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 MiB"
time=2025-06-05T17:55:08.599-05:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=550502400A allocated.CPU.Graph=10485760A allocated.CUDA0.Weights="[363438080A 363438080A 363438080A 363438080A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Cache="[295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Graph=560652288A allocated.CUDA1.Weights="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 363438080A 363438080A 363438080A 363438080A 363438080A 1255526400A]" allocated.CUDA1.Cache="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 0U]" allocated.CUDA1.Graph=9791055360A
time=2025-06-05T17:55:08.684-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.06"
time=2025-06-05T17:55:08.934-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.20"
time=2025-06-05T17:55:09.185-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.30"
time=2025-06-05T17:55:09.436-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.39"
time=2025-06-05T17:55:09.687-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.48"
time=2025-06-05T17:55:09.937-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.66"
[GIN] 2025/06/05 - 17:55:10 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:55:10 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:55:10.188-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.83"
time=2025-06-05T17:55:10.438-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.98"
time=2025-06-05T17:55:10.689-05:00 level=INFO source=server.go:630 msg="llama runner started in 3.01 seconds"
time=2025-06-05T17:55:10.689-05:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="50.4 GiB" runner.vram="50.4 GiB" runner.parallel=1 runner.pid=38020 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=72112
time=2025-06-05T17:55:10.711-05:00 level=DEBUG source=server.go:729 msg="completion request" images=0 prompt=1600 format=""
time=2025-06-05T17:55:10.726-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[1]
time=2025-06-05T17:55:10.727-05:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=365 used=0 remaining=365
[GIN] 2025/06/05 - 17:55:11 | 200 |    8.8926139s |       10.0.0.25 | POST     "/api/chat"
time=2025-06-05T17:55:11.238-05:00 level=DEBUG source=sched.go:503 msg="context for request finished"
time=2025-06-05T17:55:11.238-05:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="50.4 GiB" runner.vram="50.4 GiB" runner.parallel=1 runner.pid=38020 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=72112 duration=2562047h47m16.854775807s
time=2025-06-05T17:55:11.238-05:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="50.4 GiB" runner.vram="50.4 GiB" runner.parallel=1 runner.pid=38020 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=72112 refCount=0
[GIN] 2025/06/05 - 17:55:12 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:55:12 | 200 |       512.5µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:55:14 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:55:14 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:55:16 | 200 |       511.8µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:55:16 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/06/05 - 17:55:18 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:55:18 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-06-05T17:55:19.049-05:00 level=DEBUG source=sched.go:322 msg="shutting down scheduler completed loop"
time=2025-06-05T17:55:19.049-05:00 level=DEBUG source=sched.go:122 msg="shutting down scheduler pending loop"
time=2025-06-05T17:55:19.049-05:00 level=DEBUG source=sched.go:872 msg="shutting down runner" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc
time=2025-06-05T17:55:19.049-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=38020
time=2025-06-05T17:55:19.049-05:00 level=DEBUG source=server.go:1029 msg="waiting for llama server to exit" pid=38020
time=2025-06-05T17:55:19.266-05:00 level=DEBUG source=server.go:1033 msg="llama server stopped" pid=38020

<!-- gh-comment-id:2946760354 --> @MarkWard0110 commented on GitHub (Jun 5, 2025): Mistral still loading weird with 0.9.0. This time with a 5090 + 3090 and the q8 model ``` time=2025-06-05T17:53:35.530-05:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:o:\\ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-06-05T17:53:35.536-05:00 level=INFO source=images.go:479 msg="total blobs: 129" time=2025-06-05T17:53:35.538-05:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0" time=2025-06-05T17:53:35.539-05:00 level=INFO source=routes.go:1287 msg="Listening on [::]:11434 (version 0.9.0)" time=2025-06-05T17:53:35.539-05:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler" time=2025-06-05T17:53:35.539-05:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-06-05T17:53:35.539-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-06-05T17:53:35.539-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-06-05T17:53:35.539-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32 time=2025-06-05T17:53:35.539-05:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-06-05T17:53:35.539-05:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll time=2025-06-05T17:53:35.539-05:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files\\Volta\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvml.dll C:\\Program Files\\Amazon\\AWSCLIV2\\nvml.dll F:\\software\\terraform\\nvml.dll C:\\Program Files\\GitHub CLI\\nvml.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvml.dll C:\\Program Files\\PowerShell\\7\\nvml.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Volta\\bin\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin\\nvml.dll C:\\Users\\wardm\\.dotnet\\tools\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-06-05T17:53:35.540-05:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2025-06-05T17:53:35.540-05:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-06-05T17:53:35.550-05:00 level=DEBUG source=gpu.go:111 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2025-06-05T17:53:35.550-05:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvcuda.dll time=2025-06-05T17:53:35.550-05:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files\\Volta\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\Amazon\\AWSCLIV2\\nvcuda.dll F:\\software\\terraform\\nvcuda.dll C:\\Program Files\\GitHub CLI\\nvcuda.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvcuda.dll C:\\Program Files\\PowerShell\\7\\nvcuda.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Volta\\bin\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin\\nvcuda.dll C:\\Users\\wardm\\.dotnet\\tools\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2025-06-05T17:53:35.552-05:00 level=DEBUG source=gpu.go:529 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2025-06-05T17:53:35.553-05:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] initializing C:\Windows\system32\nvcuda.dll dlsym: cuInit - 00007FF9C2411F80 dlsym: cuDriverGetVersion - 00007FF9C2412020 dlsym: cuDeviceGetCount - 00007FF9C2412816 dlsym: cuDeviceGet - 00007FF9C2412810 dlsym: cuDeviceGetAttribute - 00007FF9C2412170 dlsym: cuDeviceGetUuid - 00007FF9C2412822 dlsym: cuDeviceGetName - 00007FF9C241281C dlsym: cuCtxCreate_v3 - 00007FF9C2412894 dlsym: cuMemGetInfo_v2 - 00007FF9C2412996 dlsym: cuCtxDestroy - 00007FF9C24128A6 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 2 time=2025-06-05T17:53:35.564-05:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll [GPU-32fda0b3-4602-83bb-0be7-24ef41847cda] CUDA totalMem 32606mb [GPU-32fda0b3-4602-83bb-0be7-24ef41847cda] CUDA freeMem 30843mb [GPU-32fda0b3-4602-83bb-0be7-24ef41847cda] Compute Capability 12.0 time=2025-06-05T17:53:35.688-05:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda library=cuda compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" [GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5] CUDA totalMem 24575mb [GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5] CUDA freeMem 23306mb [GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5] Compute Capability 8.6 time=2025-06-05T17:53:35.757-05:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 library=cuda compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" time=2025-06-05T17:53:35.758-05:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found." releasing cuda driver library releasing nvml library time=2025-06-05T17:53:35.758-05:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" total="31.8 GiB" available="30.1 GiB" time=2025-06-05T17:53:35.759-05:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB" time=2025-06-05T17:53:48.062-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:53:48.063-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="67.9 GiB" before.free_swap="65.5 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB" time=2025-06-05T17:53:48.094-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:53:48.094-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:53:48.104-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:53:48.116-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:53:48.117-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:53:48.117-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]" time=2025-06-05T17:53:48.117-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB" time=2025-06-05T17:53:48.157-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:53:48.157-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:53:48.159-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]" time=2025-06-05T17:53:48.159-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB" time=2025-06-05T17:53:48.188-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:53:48.188-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:53:48.189-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:53:48.189-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB" time=2025-06-05T17:53:48.220-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:53:48.220-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:53:48.220-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 library=cuda parallel=1 required="41.7 GiB" time=2025-06-05T17:53:48.220-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB" time=2025-06-05T17:53:48.251-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:53:48.251-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:53:48.252-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.0 GiB" free_swap="65.7 GiB" time=2025-06-05T17:53:48.252-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:53:48.252-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="68.0 GiB" now.free_swap="65.7 GiB" time=2025-06-05T17:53:48.279-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:53:48.279-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:53:48.280-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="41.7 GiB" memory.required.partial="41.7 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[25.9 GiB 15.8 GiB]" memory.weights.total="22.8 GiB" memory.weights.repeating="22.2 GiB" memory.weights.nonrepeating="680.0 MiB" memory.graph.full="2.1 GiB" memory.graph.partial="2.1 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-06-05T17:53:48.280-05:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-06-05T17:53:48.280-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" time=2025-06-05T17:53:48.280-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:53:48.303-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-06-05T17:53:48.307-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-06-05T17:53:48.308-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12] time=2025-06-05T17:53:48.308-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 --ctx-size 20225 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52855" time=2025-06-05T17:53:48.308-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 time=2025-06-05T17:53:48.310-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-05T17:53:48.310-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-06-05T17:53:48.312-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-06-05T17:53:48.337-05:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-06-05T17:53:48.337-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52855" time=2025-06-05T17:53:48.356-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:53:48.357-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default="" time=2025-06-05T17:53:48.357-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default="" time=2025-06-05T17:53:48.357-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q8_0 name="" description="" num_tensors=585 num_key_values=43 time=2025-06-05T17:53:48.357-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-06-05T17:53:48.366-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-06-05T17:53:48.497-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-06-05T17:53:48.563-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-06-05T17:53:48.591-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="680.0 MiB" time=2025-06-05T17:53:48.591-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="11.6 GiB" time=2025-06-05T17:53:48.591-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="12.0 GiB" time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:53:48.591-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-06-05T17:53:48.778-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1 time=2025-06-05T17:53:48.778-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-06-05T17:53:48.778-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-06-05T17:53:48.778-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" time=2025-06-05T17:53:49.093-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1265 splits=4 time=2025-06-05T17:53:49.093-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="332.7 MiB" time=2025-06-05T17:53:49.093-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-06-05T17:53:49.093-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 MiB" time=2025-06-05T17:53:49.094-05:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=713031680A allocated.CPU.Graph=10485760A allocated.CUDA0.Weights="[595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Cache="[83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Graph=348839936A allocated.CUDA1.Weights="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 1591070720A]" allocated.CUDA1.Cache="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 0U]" allocated.CUDA1.Graph=9791055360A time=2025-06-05T17:53:49.316-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.09" time=2025-06-05T17:53:49.566-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.17" time=2025-06-05T17:53:49.817-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.22" time=2025-06-05T17:53:50.067-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.27" time=2025-06-05T17:53:50.318-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.33" time=2025-06-05T17:53:50.569-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.38" time=2025-06-05T17:53:50.819-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.44" time=2025-06-05T17:53:51.070-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.51" time=2025-06-05T17:53:51.320-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.61" time=2025-06-05T17:53:51.571-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.72" time=2025-06-05T17:53:51.821-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.81" time=2025-06-05T17:53:52.072-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.91" time=2025-06-05T17:53:52.322-05:00 level=INFO source=server.go:630 msg="llama runner started in 4.01 seconds" time=2025-06-05T17:53:52.322-05:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:53:52.346-05:00 level=DEBUG source=server.go:729 msg="completion request" images=0 prompt=1600 format="" time=2025-06-05T17:53:52.360-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[1] time=2025-06-05T17:53:52.360-05:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=365 used=0 remaining=365 [GIN] 2025/06/05 - 17:53:52 | 200 | 4.8473179s | 10.0.0.25 | POST "/api/chat" time=2025-06-05T17:53:52.897-05:00 level=DEBUG source=sched.go:503 msg="context for request finished" time=2025-06-05T17:53:52.897-05:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 duration=2562047h47m16.854775807s time=2025-06-05T17:53:52.897-05:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 refCount=0 [GIN] 2025/06/05 - 17:53:53 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:53:53 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:53:55 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:53:55 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:53:57 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:53:57 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:00 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:00 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:02 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:02 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:04 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:04 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:06 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:06 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:08 | 200 | 538µs | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:08 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:10 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:10 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:12 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:12 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:15 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:15 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:17 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:17 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:19 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:19 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:54:20.987-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:151 msg=reloading runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:287 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 refCount=0 time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:300 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:364 msg="runner expired event received" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:379 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=sched.go:402 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:54:20.988-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.0 GiB" before.free_swap="65.7 GiB" now.total="95.7 GiB" now.free="66.7 GiB" now.free_swap="27.0 GiB" time=2025-06-05T17:54:21.014-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="15.9 GiB" now.used="14.7 GiB" time=2025-06-05T17:54:21.014-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="17179869183.8 GiB" now.used="23.2 GiB" releasing nvml library time=2025-06-05T17:54:21.015-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=18128 time=2025-06-05T17:54:21.015-05:00 level=DEBUG source=server.go:1029 msg="waiting for llama server to exit" pid=18128 time=2025-06-05T17:54:21.239-05:00 level=DEBUG source=server.go:1033 msg="llama server stopped" pid=18128 time=2025-06-05T17:54:21.239-05:00 level=DEBUG source=sched.go:407 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:21.266-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="66.7 GiB" before.free_swap="27.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:21.302-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="15.9 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:21.302-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="17179869183.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:21.303-05:00 level=DEBUG source=sched.go:700 msg="gpu VRAM free memory converged after 0.31 seconds" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:21.303-05:00 level=DEBUG source=sched.go:410 msg="sending an unloaded event" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:21.303-05:00 level=DEBUG source=sched.go:306 msg="unload completed" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=18128 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:21.303-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:21.332-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:21.332-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:21.342-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:21.359-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:21.360-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:21.360-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]" time=2025-06-05T17:54:21.361-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:21.398-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:21.398-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:21.399-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]" time=2025-06-05T17:54:21.400-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:21.429-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:21.429-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:21.429-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:54:21.430-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:21.460-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:21.460-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:21.461-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 library=cuda parallel=1 required="46.1 GiB" time=2025-06-05T17:54:21.461-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:21.487-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:21.487-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:21.488-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.2 GiB" free_swap="65.9 GiB" time=2025-06-05T17:54:21.488-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:54:21.488-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:21.511-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:21.511-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:21.512-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="46.1 GiB" memory.required.partial="46.1 GiB" memory.required.kv="4.9 GiB" memory.required.allocations="[28.1 GiB 18.0 GiB]" memory.weights.total="22.8 GiB" memory.weights.repeating="22.2 GiB" memory.weights.nonrepeating="680.0 MiB" memory.graph.full="3.3 GiB" memory.graph.partial="3.3 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-06-05T17:54:21.512-05:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-06-05T17:54:21.512-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" time=2025-06-05T17:54:21.512-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" time=2025-06-05T17:54:21.536-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:54:21.538-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-06-05T17:54:21.542-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-06-05T17:54:21.542-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12] time=2025-06-05T17:54:21.542-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 --ctx-size 32225 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52880" time=2025-06-05T17:54:21.542-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 time=2025-06-05T17:54:21.545-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-05T17:54:21.545-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-06-05T17:54:21.545-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-06-05T17:54:21.573-05:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-06-05T17:54:21.573-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52880" time=2025-06-05T17:54:21.592-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:21.592-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default="" time=2025-06-05T17:54:21.592-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default="" time=2025-06-05T17:54:21.592-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q8_0 name="" description="" num_tensors=585 num_key_values=43 time=2025-06-05T17:54:21.593-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-06-05T17:54:21.601-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-06-05T17:54:21.729-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) [GIN] 2025/06/05 - 17:54:21 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:21 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:54:21.797-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-06-05T17:54:21.831-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="12.0 GiB" time=2025-06-05T17:54:21.831-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="680.0 MiB" time=2025-06-05T17:54:21.831-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="11.6 GiB" time=2025-06-05T17:54:21.832-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:54:21.832-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:54:21.833-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:54:21.833-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:54:21.833-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:54:21.833-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-06-05T17:54:22.021-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1 time=2025-06-05T17:54:22.021-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-06-05T17:54:22.022-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-06-05T17:54:22.022-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 63.00 MiB on device 1: cudaMalloc failed: out of memory panic: insufficient memory - required allocations: {InputWeights:713031680A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} GPUs:[{Name:CUDA0 Weights:[595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 1591070720A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576A 132120576F 0U 0U 0U] Graph:9791055360A}]} goroutine 29 [running]: github.com/ollama/ollama/ml/backend/ggml.(*Context).newTensor(0xc001088840, 0x1d0e3b13e20?, {0xc001086168, 0x3, 0x7ff7a21bbe02?}) C:/a/ollama/ollama/ml/backend/ggml/ggml.go:714 +0x696 github.com/ollama/ollama/ml/backend/ggml.(*Context).Zeros(0x7ff7a32c6ec0?, 0xc001118840?, {0xc001086168?, 0xc000056508?, 0x7ff7a2205009?}) C:/a/ollama/ollama/ml/backend/ggml/ggml.go:727 +0x1c github.com/ollama/ollama/kvcache.(*Causal).Put(0xc001116960, {0x7ff7a35ca8c8, 0xc0011dcb00}, {0x7ff7a35d6a48, 0xc0011700d8}, {0x7ff7a35d6a48, 0xc001170120}) C:/a/ollama/ollama/kvcache/causal.go:566 +0x4cc github.com/ollama/ollama/ml/nn.Attention({0x7ff7a35ca8c8, 0xc0011dcb00}, {0x7ff7a35d6a48, 0xc001170078}, {0x7ff7a35d6a48, 0xc0011700d8}, {0x7ff7a35d6a48, 0xc001170120}, 0x3fb6a09e667f3bcc, {0x7ff7a35c9a00, ...}) C:/a/ollama/ollama/ml/nn/attention.go:39 +0x1c3 github.com/ollama/ollama/model/models/mistral3.(*SelfAttention).Forward(0xc001173d60, {0x7ff7a35ca8c8, 0xc0011dcb00}, {0x7ff7a35d6a48, 0xc001170000}, {0x7ff7a35d6a48, 0xc00114a210}, {0x7ff7a35c9a00, 0xc001116960}, 0xc001146f00) C:/a/ollama/ollama/model/models/mistral3/model_text.go:52 +0x3f3 github.com/ollama/ollama/model/models/mistral3.(*Layer).Forward(0xc00004b8e0, {0x7ff7a35ca8c8, 0xc0011dcb00}, {0x7ff7a35d6a48, 0xc00116bf68}, {0x7ff7a35d6a48, 0xc00114a210}, {0x0, 0x0}, {0x7ff7a35c9a00, ...}, ...) C:/a/ollama/ollama/model/models/mistral3/model_text.go:83 +0xd9 github.com/ollama/ollama/model/models/mistral3.(*TextModel).Forward(0xc001144580, {0x7ff7a35ca8c8, 0xc0011dcb00}, {0x7ff7a35d6a48?, 0xc00114a1c8?}, {0x7ff7a35d6a48, 0xc00114a210}, {0x7ff7a35d6a48, 0xc00114a228}, {{0x7ff7a35d6a48, ...}, ...}, ...) C:/a/ollama/ollama/model/models/mistral3/model_text.go:117 +0x3aa github.com/ollama/ollama/model/models/mistral3.(*Model).Forward(0xc00038f0a0, {0x7ff7a35ca8c8, 0xc0011dcb00}, {{0x7ff7a35d6a48, 0xc00114a1c8}, {0xc000140600, 0x9, 0x10}, {0xc000330800, 0x200, ...}, ...}) C:/a/ollama/ollama/model/models/mistral3/model.go:164 +0x1d7 github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc0006366c0) C:/a/ollama/ollama/runner/ollamarunner/runner.go:821 +0xac5 github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc0006366c0, {0xc0000de000?, 0x0?}, {0x8, 0x0, 0x29, {0xc00037fbf8, 0x2, 0x2}, 0x1}, ...) C:/a/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270 github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc0006366c0, {0x7ff7a35c2540, 0xc000146410}, {0xc0000de000?, 0x0?}, {0x8, 0x0, 0x29, {0xc00037fbf8, 0x2, ...}, ...}, ...) C:/a/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 C:/a/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11 time=2025-06-05T17:54:22.512-05:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" time=2025-06-05T17:54:22.548-05:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory" time=2025-06-05T17:54:22.548-05:00 level=DEBUG source=sched.go:491 msg="triggering expiration for failed load" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=32225 time=2025-06-05T17:54:22.548-05:00 level=DEBUG source=sched.go:364 msg="runner expired event received" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=32225 time=2025-06-05T17:54:22.548-05:00 level=DEBUG source=sched.go:379 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=32225 time=2025-06-05T17:54:22.548-05:00 level=DEBUG source=sched.go:402 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=32225 time=2025-06-05T17:54:22.548-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" [GIN] 2025/06/05 - 17:54:22 | 500 | 1.5788971s | 10.0.0.25 | POST "/api/chat" time=2025-06-05T17:54:22.584-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:22.584-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:22.584-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=41896 time=2025-06-05T17:54:22.584-05:00 level=DEBUG source=sched.go:407 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:22.835-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:22.873-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:22.873-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:23.084-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.1 GiB" now.free_swap="65.8 GiB" time=2025-06-05T17:54:23.115-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:23.115-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:23.334-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.1 GiB" before.free_swap="65.8 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:23.367-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:23.367-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:23.584-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:23.602-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:23.602-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:23.835-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:23.862-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:23.862-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library [GIN] 2025/06/05 - 17:54:23 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:23 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:54:24.085-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:24.123-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:24.123-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:24.337-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:24.361-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:24.361-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:24.585-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:24.614-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:24.614-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:24.835-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:24.867-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:24.867-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:25.085-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:25.111-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:25.111-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:25.335-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:25.373-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:25.373-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:25.585-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:25.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:25.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:25.835-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:25.866-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:25.866-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:26.085-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:26.108-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:26.110-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:26.110-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:26.110-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:26.141-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:26.141-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:26.151-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:26.161-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:26.162-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:26.162-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]" time=2025-06-05T17:54:26.163-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:26.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:26.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:26.204-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]" time=2025-06-05T17:54:26.204-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" [GIN] 2025/06/05 - 17:54:26 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:26 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:54:26.235-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:26.235-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:26.235-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:54:26.235-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:26.264-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:26.264-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:26.265-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 library=cuda parallel=1 required="41.7 GiB" time=2025-06-05T17:54:26.265-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:26.296-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:26.296-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:26.297-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.2 GiB" free_swap="65.9 GiB" time=2025-06-05T17:54:26.297-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:54:26.297-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:26.327-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:26.327-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:26.328-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="41.7 GiB" memory.required.partial="41.7 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[25.9 GiB 15.8 GiB]" memory.weights.total="22.8 GiB" memory.weights.repeating="22.2 GiB" memory.weights.nonrepeating="680.0 MiB" memory.graph.full="2.1 GiB" memory.graph.partial="2.1 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-06-05T17:54:26.328-05:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-06-05T17:54:26.328-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" time=2025-06-05T17:54:26.328-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" time=2025-06-05T17:54:26.343-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:54:26.364-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-06-05T17:54:26.366-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-06-05T17:54:26.366-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12] time=2025-06-05T17:54:26.366-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 --ctx-size 20225 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52886" time=2025-06-05T17:54:26.366-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 time=2025-06-05T17:54:26.370-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-05T17:54:26.370-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-06-05T17:54:26.372-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-06-05T17:54:26.373-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:26.373-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:26.398-05:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-06-05T17:54:26.398-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52886" time=2025-06-05T17:54:26.416-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:26.418-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default="" time=2025-06-05T17:54:26.418-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default="" time=2025-06-05T17:54:26.418-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q8_0 name="" description="" num_tensors=585 num_key_values=43 time=2025-06-05T17:54:26.418-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-06-05T17:54:26.427-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-06-05T17:54:26.568-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-06-05T17:54:26.585-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="67.9 GiB" now.free_swap="52.6 GiB" time=2025-06-05T17:54:26.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="519.2 MiB" time=2025-06-05T17:54:26.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="21.2 GiB" now.used="1.8 GiB" releasing nvml library time=2025-06-05T17:54:26.611-05:00 level=DEBUG source=sched.go:700 msg="gpu VRAM free memory converged after 4.06 seconds" runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:26.611-05:00 level=DEBUG source=sched.go:410 msg="sending an unloaded event" runner.size="46.1 GiB" runner.vram="46.1 GiB" runner.parallel=1 runner.pid=41896 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:26.611-05:00 level=DEBUG source=sched.go:312 msg="ignoring unload event with no pending requests" time=2025-06-05T17:54:26.623-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-06-05T17:54:26.657-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="12.0 GiB" time=2025-06-05T17:54:26.657-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="680.0 MiB" time=2025-06-05T17:54:26.657-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="11.6 GiB" time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:54:26.659-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-06-05T17:54:26.846-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1 time=2025-06-05T17:54:26.847-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-06-05T17:54:26.847-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-06-05T17:54:26.847-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" time=2025-06-05T17:54:27.158-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1265 splits=4 time=2025-06-05T17:54:27.158-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="332.7 MiB" time=2025-06-05T17:54:27.158-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-06-05T17:54:27.158-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 MiB" time=2025-06-05T17:54:27.159-05:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=713031680A allocated.CPU.Graph=10485760A allocated.CUDA0.Weights="[595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Cache="[83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Graph=348839936A allocated.CUDA1.Weights="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 1591070720A]" allocated.CUDA1.Cache="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 83886080A 0U]" allocated.CUDA1.Graph=9791055360A time=2025-06-05T17:54:27.373-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.08" time=2025-06-05T17:54:27.624-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.17" time=2025-06-05T17:54:27.874-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.22" time=2025-06-05T17:54:28.125-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.28" time=2025-06-05T17:54:28.375-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.33" [GIN] 2025/06/05 - 17:54:28 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:28 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:54:28.626-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.38" time=2025-06-05T17:54:28.876-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.44" time=2025-06-05T17:54:29.126-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.50" time=2025-06-05T17:54:29.377-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.61" time=2025-06-05T17:54:29.627-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.71" time=2025-06-05T17:54:29.878-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.81" time=2025-06-05T17:54:30.128-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.91" time=2025-06-05T17:54:30.379-05:00 level=INFO source=server.go:630 msg="llama runner started in 4.01 seconds" time=2025-06-05T17:54:30.379-05:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:54:30.401-05:00 level=DEBUG source=server.go:729 msg="completion request" images=0 prompt=1600 format="" time=2025-06-05T17:54:30.415-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[1] time=2025-06-05T17:54:30.415-05:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=365 used=0 remaining=365 [GIN] 2025/06/05 - 17:54:30 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:30 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:30 | 200 | 4.8496251s | 10.0.0.25 | POST "/api/chat" time=2025-06-05T17:54:30.945-05:00 level=DEBUG source=sched.go:503 msg="context for request finished" time=2025-06-05T17:54:30.945-05:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 duration=2562047h47m16.854775807s time=2025-06-05T17:54:30.945-05:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 refCount=0 [GIN] 2025/06/05 - 17:54:33 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:33 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:35 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:35 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:54:36.034-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:151 msg=reloading runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:287 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 refCount=0 time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:300 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:364 msg="runner expired event received" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:379 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=sched.go:402 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=20225 time=2025-06-05T17:54:36.035-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="67.9 GiB" before.free_swap="52.6 GiB" now.total="95.7 GiB" now.free="66.8 GiB" now.free_swap="27.1 GiB" time=2025-06-05T17:54:36.072-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="15.9 GiB" now.used="14.7 GiB" time=2025-06-05T17:54:36.072-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="21.2 GiB" now.total="24.0 GiB" now.free="17179869183.8 GiB" now.used="23.2 GiB" releasing nvml library time=2025-06-05T17:54:36.073-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=42340 time=2025-06-05T17:54:36.074-05:00 level=DEBUG source=server.go:1029 msg="waiting for llama server to exit" pid=42340 time=2025-06-05T17:54:36.283-05:00 level=DEBUG source=server.go:1033 msg="llama server stopped" pid=42340 time=2025-06-05T17:54:36.283-05:00 level=DEBUG source=sched.go:407 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:36.325-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="66.8 GiB" before.free_swap="27.1 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:36.359-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="15.9 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:36.359-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="17179869183.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:36.360-05:00 level=DEBUG source=sched.go:700 msg="gpu VRAM free memory converged after 0.32 seconds" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:36.360-05:00 level=DEBUG source=sched.go:410 msg="sending an unloaded event" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:36.360-05:00 level=DEBUG source=sched.go:306 msg="unload completed" runner.size="41.7 GiB" runner.vram="41.7 GiB" runner.parallel=1 runner.pid=42340 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:36.360-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:36.390-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:36.390-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:36.400-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:36.411-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:36.411-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:36.411-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]" time=2025-06-05T17:54:36.412-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:36.438-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:36.439-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:36.439-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]" time=2025-06-05T17:54:36.440-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:36.470-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:36.470-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:36.471-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:54:36.471-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:36.501-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:36.501-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:36.501-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 library=cuda parallel=1 required="50.1 GiB" time=2025-06-05T17:54:36.501-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:36.535-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:36.535-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:36.535-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.2 GiB" free_swap="65.9 GiB" time=2025-06-05T17:54:36.535-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:54:36.535-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:36.564-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:36.564-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:36.565-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="50.1 GiB" memory.required.partial="50.1 GiB" memory.required.kv="6.6 GiB" memory.required.allocations="[30.1 GiB 20.0 GiB]" memory.weights.total="22.8 GiB" memory.weights.repeating="22.2 GiB" memory.weights.nonrepeating="680.0 MiB" memory.graph.full="4.4 GiB" memory.graph.partial="4.4 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-06-05T17:54:36.565-05:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-06-05T17:54:36.565-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" time=2025-06-05T17:54:36.565-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" time=2025-06-05T17:54:36.596-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:54:36.599-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-06-05T17:54:36.603-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-06-05T17:54:36.603-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12] time=2025-06-05T17:54:36.603-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 --ctx-size 43225 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52894" time=2025-06-05T17:54:36.603-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 time=2025-06-05T17:54:36.605-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-05T17:54:36.606-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-06-05T17:54:36.606-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-06-05T17:54:36.632-05:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-06-05T17:54:36.632-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52894" time=2025-06-05T17:54:36.651-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:36.652-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default="" time=2025-06-05T17:54:36.652-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default="" time=2025-06-05T17:54:36.652-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q8_0 name="" description="" num_tensors=585 num_key_values=43 time=2025-06-05T17:54:36.652-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-06-05T17:54:36.660-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-06-05T17:54:36.788-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-06-05T17:54:36.857-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-06-05T17:54:36.884-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="12.0 GiB" time=2025-06-05T17:54:36.884-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="680.0 MiB" time=2025-06-05T17:54:36.884-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="11.6 GiB" time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:54:36.884-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-06-05T17:54:37.076-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1 time=2025-06-05T17:54:37.076-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-06-05T17:54:37.076-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-06-05T17:54:37.076-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" [GIN] 2025/06/05 - 17:54:37 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:37 | 200 | 0s | 127.0.0.1 | GET "/api/ps" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 84.50 MiB on device 1: cudaMalloc failed: out of memory panic: insufficient memory - required allocations: {InputWeights:713031680A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} GPUs:[{Name:CUDA0 Weights:[595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 595435520A 1591070720A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344F 0U 0U 0U 0U 0U 0U 0U] Graph:9791055360A}]} goroutine 66 [running]: github.com/ollama/ollama/ml/backend/ggml.(*Context).newTensor(0xc0002ea380, 0x1881be03720?, {0xc0012302b8, 0x3, 0x7ff7a21bbe02?}) C:/a/ollama/ollama/ml/backend/ggml/ggml.go:714 +0x696 github.com/ollama/ollama/ml/backend/ggml.(*Context).Zeros(0x7ff7a32c6ec0?, 0xc001242ea0?, {0xc0012302b8?, 0xc00005f908?, 0x7ff7a2205009?}) C:/a/ollama/ollama/ml/backend/ggml/ggml.go:727 +0x1c github.com/ollama/ollama/kvcache.(*Causal).Put(0xc00122aa50, {0x7ff7a35ca8c8, 0xc0011a3cc0}, {0x7ff7a35d6a48, 0xc0010bb1e8}, {0x7ff7a35d6a48, 0xc0010bb218}) C:/a/ollama/ollama/kvcache/causal.go:566 +0x4cc github.com/ollama/ollama/ml/nn.Attention({0x7ff7a35ca8c8, 0xc0011a3cc0}, {0x7ff7a35d6a48, 0xc0010bb170}, {0x7ff7a35d6a48, 0xc0010bb1e8}, {0x7ff7a35d6a48, 0xc0010bb218}, 0x3fb6a09e667f3bcc, {0x7ff7a35c9a00, ...}) C:/a/ollama/ollama/ml/nn/attention.go:39 +0x1c3 github.com/ollama/ollama/model/models/mistral3.(*SelfAttention).Forward(0xc00034ff40, {0x7ff7a35ca8c8, 0xc0011a3cc0}, {0x7ff7a35d6a48, 0xc0010bb110}, {0x7ff7a35d6a48, 0xc000009dd0}, {0x7ff7a35c9a00, 0xc00122aa50}, 0xc0010aef00) C:/a/ollama/ollama/model/models/mistral3/model_text.go:52 +0x3f3 github.com/ollama/ollama/model/models/mistral3.(*Layer).Forward(0xc00004b8e0, {0x7ff7a35ca8c8, 0xc0011a3cc0}, {0x7ff7a35d6a48, 0xc0010bb0f8}, {0x7ff7a35d6a48, 0xc000009dd0}, {0x0, 0x0}, {0x7ff7a35c9a00, ...}, ...) C:/a/ollama/ollama/model/models/mistral3/model_text.go:83 +0xd9 github.com/ollama/ollama/model/models/mistral3.(*TextModel).Forward(0xc00127f740, {0x7ff7a35ca8c8, 0xc0011a3cc0}, {0x7ff7a35d6a48?, 0xc000009da0?}, {0x7ff7a35d6a48, 0xc000009dd0}, {0x7ff7a35d6a48, 0xc000009de8}, {{0x7ff7a35d6a48, ...}, ...}, ...) C:/a/ollama/ollama/model/models/mistral3/model_text.go:117 +0x3aa github.com/ollama/ollama/model/models/mistral3.(*Model).Forward(0xc000454f50, {0x7ff7a35ca8c8, 0xc0011a3cc0}, {{0x7ff7a35d6a48, 0xc000009da0}, {0xc000166600, 0x9, 0x10}, {0xc000e7a800, 0x200, ...}, ...}) C:/a/ollama/ollama/model/models/mistral3/model.go:164 +0x1d7 github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc00054e000) C:/a/ollama/ollama/runner/ollamarunner/runner.go:821 +0xac5 github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc00054e000, {0xc0000de000?, 0x0?}, {0x8, 0x0, 0x29, {0xc0003c6268, 0x2, 0x2}, 0x1}, ...) C:/a/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270 github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc00054e000, {0x7ff7a35c2540, 0xc000550000}, {0xc0000de000?, 0x0?}, {0x8, 0x0, 0x29, {0xc0003c6268, 0x2, ...}, ...}, ...) C:/a/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 C:/a/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11 time=2025-06-05T17:54:37.641-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-06-05T17:54:37.669-05:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" time=2025-06-05T17:54:37.891-05:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory" time=2025-06-05T17:54:37.891-05:00 level=DEBUG source=sched.go:491 msg="triggering expiration for failed load" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=43225 time=2025-06-05T17:54:37.891-05:00 level=DEBUG source=sched.go:364 msg="runner expired event received" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=43225 time=2025-06-05T17:54:37.891-05:00 level=DEBUG source=sched.go:379 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=43225 time=2025-06-05T17:54:37.891-05:00 level=DEBUG source=sched.go:402 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q8_0 runner.inference=cuda runner.devices=2 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 runner.num_ctx=43225 time=2025-06-05T17:54:37.891-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" [GIN] 2025/06/05 - 17:54:37 | 500 | 1.8841691s | 10.0.0.25 | POST "/api/chat" time=2025-06-05T17:54:37.917-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:37.917-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:37.917-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=24780 time=2025-06-05T17:54:37.917-05:00 level=DEBUG source=sched.go:407 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:38.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.8 GiB" time=2025-06-05T17:54:38.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:38.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:38.418-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.8 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:38.485-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:38.485-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:38.669-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:38.705-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:38.705-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:38.918-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:38.955-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:38.955-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:39.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:39.188-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:39.188-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library [GIN] 2025/06/05 - 17:54:39 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:39 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:54:39.418-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:39.452-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:39.452-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:39.668-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:39.703-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:39.703-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:39.918-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:39.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:39.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:40.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:40.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:40.203-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:40.418-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:40.468-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:40.468-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:40.668-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:40.704-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:40.704-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:40.918-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:40.951-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:40.951-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:41.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:41.205-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:41.205-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:41.419-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:41.453-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:41.453-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library [GIN] 2025/06/05 - 17:54:41 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:41 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:54:41.668-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:41.703-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:41.703-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:41.918-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:41.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:41.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:42.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:42.202-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:42.202-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:42.418-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:42.457-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:42.457-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:42.668-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:42.701-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:42.701-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:42.918-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0263834 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:42.918-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:42.921-05:00 level=DEBUG source=sched.go:410 msg="sending an unloaded event" runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:42.921-05:00 level=DEBUG source=sched.go:312 msg="ignoring unload event with no pending requests" time=2025-06-05T17:54:42.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:42.953-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:43.168-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2763602 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:43.168-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:43.202-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:43.202-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:43.418-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5263857 runner.size="50.1 GiB" runner.vram="50.1 GiB" runner.parallel=1 runner.pid=24780 runner.model=o:\ollama\models\blobs\sha256-de0f4b9634e4bb82a84dd0a376c4a6787dbf4ce5b52a62e39be103bc9c8245d0 time=2025-06-05T17:54:43.521-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:43.523-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:43.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:43.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library [GIN] 2025/06/05 - 17:54:43 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:43 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:54:43.622-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:43.632-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:43.633-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-06-05T17:54:43.633-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]" time=2025-06-05T17:54:43.633-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:43.673-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:43.673-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:43.674-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]" time=2025-06-05T17:54:43.674-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:43.705-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:43.705-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:43.706-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:54:43.707-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:43.737-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:43.737-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:43.737-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc library=cuda parallel=1 required="39.9 GiB" time=2025-06-05T17:54:43.737-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:43.764-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:43.764-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:43.766-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.2 GiB" free_swap="65.9 GiB" time=2025-06-05T17:54:43.766-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:54:43.766-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:54:43.795-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:54:43.795-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:54:43.796-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.9 GiB" memory.required.partial="39.9 GiB" memory.required.kv="6.6 GiB" memory.required.allocations="[24.9 GiB 15.0 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="4.4 GiB" memory.graph.partial="4.4 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-06-05T17:54:43.796-05:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-06-05T17:54:43.796-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" time=2025-06-05T17:54:43.796-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" time=2025-06-05T17:54:43.819-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:54:43.820-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-06-05T17:54:43.824-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-06-05T17:54:43.824-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12] time=2025-06-05T17:54:43.824-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 43225 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52903" time=2025-06-05T17:54:43.824-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 time=2025-06-05T17:54:43.827-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-05T17:54:43.827-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-06-05T17:54:43.828-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-06-05T17:54:43.852-05:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-06-05T17:54:43.853-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52903" time=2025-06-05T17:54:43.871-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:54:43.872-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default="" time=2025-06-05T17:54:43.872-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default="" time=2025-06-05T17:54:43.873-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43 time=2025-06-05T17:54:43.873-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-06-05T17:54:43.881-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-06-05T17:54:44.004-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-06-05T17:54:44.080-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-06-05T17:54:44.093-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="7.2 GiB" time=2025-06-05T17:54:44.095-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="525.0 MiB" time=2025-06-05T17:54:44.095-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="6.7 GiB" time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:54:44.095-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-06-05T17:54:44.284-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1 time=2025-06-05T17:54:44.284-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-06-05T17:54:44.284-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-06-05T17:54:44.284-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" time=2025-06-05T17:54:44.604-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1265 splits=4 time=2025-06-05T17:54:44.604-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="421.7 MiB" time=2025-06-05T17:54:44.604-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-06-05T17:54:44.604-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 MiB" time=2025-06-05T17:54:44.605-05:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=550502400A allocated.CPU.Graph=10485760A allocated.CUDA0.Weights="[363438080A 363438080A 363438080A 363438080A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Cache="[177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Graph=442163200A allocated.CUDA1.Weights="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 363438080A 363438080A 363438080A 363438080A 363438080A 1255526400A]" allocated.CUDA1.Cache="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 177209344A 0U]" allocated.CUDA1.Graph=9791055360A time=2025-06-05T17:54:44.832-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.15" time=2025-06-05T17:54:45.084-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.25" time=2025-06-05T17:54:45.335-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.34" time=2025-06-05T17:54:45.586-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.44" [GIN] 2025/06/05 - 17:54:45 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:45 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:54:45.836-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.56" time=2025-06-05T17:54:46.086-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.75" time=2025-06-05T17:54:46.337-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.90" time=2025-06-05T17:54:46.587-05:00 level=INFO source=server.go:630 msg="llama runner started in 2.76 seconds" time=2025-06-05T17:54:46.587-05:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 time=2025-06-05T17:54:46.611-05:00 level=DEBUG source=server.go:729 msg="completion request" images=0 prompt=1600 format="" time=2025-06-05T17:54:46.624-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[1] time=2025-06-05T17:54:46.624-05:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=365 used=0 remaining=365 [GIN] 2025/06/05 - 17:54:47 | 200 | 3.6897465s | 10.0.0.25 | POST "/api/chat" time=2025-06-05T17:54:47.147-05:00 level=DEBUG source=sched.go:503 msg="context for request finished" time=2025-06-05T17:54:47.147-05:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 duration=2562047h47m16.854775807s time=2025-06-05T17:54:47.147-05:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 refCount=0 [GIN] 2025/06/05 - 17:54:47 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:47 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:50 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:50 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:52 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:52 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:54 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:54 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:56 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:56 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:54:59 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:54:59 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:55:01 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:55:01 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:55:02.357-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:151 msg=reloading runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:287 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 refCount=0 time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:300 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:364 msg="runner expired event received" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:379 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=sched.go:402 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=43225 time=2025-06-05T17:55:02.358-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="67.0 GiB" now.free_swap="33.4 GiB" time=2025-06-05T17:55:02.388-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="18.9 GiB" now.used="11.7 GiB" time=2025-06-05T17:55:02.388-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="2.9 GiB" now.used="20.1 GiB" releasing nvml library time=2025-06-05T17:55:02.388-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=9872 time=2025-06-05T17:55:02.389-05:00 level=DEBUG source=server.go:1029 msg="waiting for llama server to exit" pid=9872 time=2025-06-05T17:55:02.585-05:00 level=DEBUG source=server.go:1033 msg="llama server stopped" pid=9872 time=2025-06-05T17:55:02.585-05:00 level=DEBUG source=sched.go:407 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-06-05T17:55:02.639-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="67.0 GiB" before.free_swap="33.4 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:55:02.676-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="18.9 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:02.676-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="2.9 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:02.888-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:55:02.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:02.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:03.138-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:03.172-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:03.172-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:03.389-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:55:03.419-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:03.419-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library [GIN] 2025/06/05 - 17:55:03 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:55:03 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:55:03.638-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:03.672-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:03.672-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:03.889-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:55:03.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:03.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:04.138-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:04.164-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:04.164-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:04.388-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:04.421-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:04.421-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:04.639-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:04.666-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:04.666-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:04.889-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:04.925-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:04.925-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:05.139-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:05.170-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:05.171-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:05.389-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:05.422-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:05.422-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:05.640-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:05.674-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:05.674-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library [GIN] 2025/06/05 - 17:55:05 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:55:05 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:55:05.889-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:05.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:05.921-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:06.139-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:55:06.171-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:06.171-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:06.388-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:06.419-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:06.419-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:06.638-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:06.672-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:06.672-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:06.889-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:55:06.922-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:06.922-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:07.138-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:07.176-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:07.176-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:07.388-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0305489 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-06-05T17:55:07.388-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:07.391-05:00 level=DEBUG source=sched.go:410 msg="sending an unloaded event" runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-06-05T17:55:07.391-05:00 level=DEBUG source=sched.go:306 msg="unload completed" runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-06-05T17:55:07.422-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:07.422-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:07.423-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:07.454-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:07.454-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:07.464-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:55:07.483-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:55:07.485-05:00 level=DEBUG source=sched.go:228 msg="loading first model" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-06-05T17:55:07.485-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[30.1 GiB]" time=2025-06-05T17:55:07.485-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:07.516-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:07.516-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:07.517-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]" time=2025-06-05T17:55:07.517-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:07.547-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:07.547-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:07.548-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:55:07.548-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="66.0 GiB" time=2025-06-05T17:55:07.578-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:07.578-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:07.578-05:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc library=cuda parallel=1 required="50.4 GiB" time=2025-06-05T17:55:07.578-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="66.0 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:55:07.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:07.610-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:07.611-05:00 level=INFO source=server.go:135 msg="system memory" total="95.7 GiB" free="68.2 GiB" free_swap="65.9 GiB" time=2025-06-05T17:55:07.611-05:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=2 available="[30.1 GiB 22.8 GiB]" time=2025-06-05T17:55:07.611-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:55:07.638-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2805774 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-06-05T17:55:07.645-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:07.645-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:07.646-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[30.1 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="50.4 GiB" memory.required.partial="50.4 GiB" memory.required.kv="11.0 GiB" memory.required.allocations="[30.1 GiB 20.3 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="7.3 GiB" memory.graph.partial="7.3 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-06-05T17:55:07.646-05:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-06-05T17:55:07.646-05:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" time=2025-06-05T17:55:07.646-05:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" time=2025-06-05T17:55:07.646-05:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="95.7 GiB" before.free="68.2 GiB" before.free_swap="65.9 GiB" now.total="95.7 GiB" now.free="68.2 GiB" now.free_swap="65.9 GiB" time=2025-06-05T17:55:07.669-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:55:07.671-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 time=2025-06-05T17:55:07.675-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda name="NVIDIA GeForce RTX 5090" overhead="1.2 GiB" before.total="31.8 GiB" before.free="30.1 GiB" now.total="31.8 GiB" now.free="30.1 GiB" now.used="506.0 MiB" time=2025-06-05T17:55:07.675-05:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 name="NVIDIA GeForce RTX 3090" overhead="1020.0 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="250.0 MiB" releasing nvml library time=2025-06-05T17:55:07.675-05:00 level=DEBUG source=server.go:360 msg="adding gpu library" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-06-05T17:55:07.675-05:00 level=DEBUG source=server.go:367 msg="adding gpu dependency paths" paths=[C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12] time=2025-06-05T17:55:07.675-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model o:\\ollama\\models\\blobs\\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 72112 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 21,20 --port 52916" time=2025-06-05T17:55:07.675-05:00 level=DEBUG source=server.go:432 msg=subprocess OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=o:\ollama\models OLLAMA_NUM_PARALLEL=1 OLLAMA_ORIGINS=* PATH="C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v12;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Program Files\\Volta\\;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\Amazon\\AWSCLIV2\\;F:\\software\\terraform;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\wardm\\AppData\\Local\\Volta\\bin;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\Scripts\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Python313\\;C:\\Users\\wardm\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\wardm\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama;C:\\Users\\wardm\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\wardm\\.dotnet\\tools;C:\\Users\\wardm\\AppData\\Local\\Programs\\Ollama\\lib\\ollama" OLLAMA_LIBRARY_PATH=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-32fda0b3-4602-83bb-0be7-24ef41847cda,GPU-94cbac27-f369-78cc-aa94-c19e5affc2a5 time=2025-06-05T17:55:07.678-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-05T17:55:07.678-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-06-05T17:55:07.678-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-06-05T17:55:07.704-05:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-06-05T17:55:07.704-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:52916" time=2025-06-05T17:55:07.722-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-06-05T17:55:07.724-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.name default="" time=2025-06-05T17:55:07.724-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.description default="" time=2025-06-05T17:55:07.724-05:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43 time=2025-06-05T17:55:07.724-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama load_backend: loaded CPU backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-06-05T17:55:07.730-05:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\Users\wardm\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-06-05T17:55:07.876-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-06-05T17:55:07.889-05:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5307762 runner.size="39.9 GiB" runner.vram="39.9 GiB" runner.parallel=1 runner.pid=9872 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-06-05T17:55:07.932-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-06-05T17:55:07.964-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="7.2 GiB" time=2025-06-05T17:55:07.964-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="525.0 MiB" time=2025-06-05T17:55:07.964-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="6.7 GiB" time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.rope.freq_scale default=1 time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.vision.longest_edge default=1540 time=2025-06-05T17:55:07.965-05:00 level=DEBUG source=ggml.go:155 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 [GIN] 2025/06/05 - 17:55:08 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:55:08 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:55:08.153-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1175 splits=1 time=2025-06-05T17:55:08.153-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-06-05T17:55:08.153-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-06-05T17:55:08.153-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" time=2025-06-05T17:55:08.598-05:00 level=DEBUG source=ggml.go:620 msg="compute graph" nodes=1265 splits=4 time=2025-06-05T17:55:08.598-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="534.7 MiB" time=2025-06-05T17:55:08.598-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-06-05T17:55:08.598-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 MiB" time=2025-06-05T17:55:08.599-05:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=550502400A allocated.CPU.Graph=10485760A allocated.CUDA0.Weights="[363438080A 363438080A 363438080A 363438080A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Cache="[295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U]" allocated.CUDA0.Graph=560652288A allocated.CUDA1.Weights="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 320184320A 320184320A 363438080A 363438080A 363438080A 363438080A 363438080A 363438080A 1255526400A]" allocated.CUDA1.Cache="[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 295698432A 0U]" allocated.CUDA1.Graph=9791055360A time=2025-06-05T17:55:08.684-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.06" time=2025-06-05T17:55:08.934-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.20" time=2025-06-05T17:55:09.185-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.30" time=2025-06-05T17:55:09.436-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.39" time=2025-06-05T17:55:09.687-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.48" time=2025-06-05T17:55:09.937-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.66" [GIN] 2025/06/05 - 17:55:10 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:55:10 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:55:10.188-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.83" time=2025-06-05T17:55:10.438-05:00 level=DEBUG source=server.go:636 msg="model load progress 0.98" time=2025-06-05T17:55:10.689-05:00 level=INFO source=server.go:630 msg="llama runner started in 3.01 seconds" time=2025-06-05T17:55:10.689-05:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="50.4 GiB" runner.vram="50.4 GiB" runner.parallel=1 runner.pid=38020 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=72112 time=2025-06-05T17:55:10.711-05:00 level=DEBUG source=server.go:729 msg="completion request" images=0 prompt=1600 format="" time=2025-06-05T17:55:10.726-05:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[1] time=2025-06-05T17:55:10.727-05:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=365 used=0 remaining=365 [GIN] 2025/06/05 - 17:55:11 | 200 | 8.8926139s | 10.0.0.25 | POST "/api/chat" time=2025-06-05T17:55:11.238-05:00 level=DEBUG source=sched.go:503 msg="context for request finished" time=2025-06-05T17:55:11.238-05:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="50.4 GiB" runner.vram="50.4 GiB" runner.parallel=1 runner.pid=38020 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=72112 duration=2562047h47m16.854775807s time=2025-06-05T17:55:11.238-05:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/mistral-small3.1:24b-instruct-2503-q4_K_M runner.inference=cuda runner.devices=2 runner.size="50.4 GiB" runner.vram="50.4 GiB" runner.parallel=1 runner.pid=38020 runner.model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc runner.num_ctx=72112 refCount=0 [GIN] 2025/06/05 - 17:55:12 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:55:12 | 200 | 512.5µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:55:14 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:55:14 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:55:16 | 200 | 511.8µs | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:55:16 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/06/05 - 17:55:18 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:55:18 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-06-05T17:55:19.049-05:00 level=DEBUG source=sched.go:322 msg="shutting down scheduler completed loop" time=2025-06-05T17:55:19.049-05:00 level=DEBUG source=sched.go:122 msg="shutting down scheduler pending loop" time=2025-06-05T17:55:19.049-05:00 level=DEBUG source=sched.go:872 msg="shutting down runner" model=o:\ollama\models\blobs\sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc time=2025-06-05T17:55:19.049-05:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=38020 time=2025-06-05T17:55:19.049-05:00 level=DEBUG source=server.go:1029 msg="waiting for llama server to exit" pid=38020 time=2025-06-05T17:55:19.266-05:00 level=DEBUG source=server.go:1033 msg="llama server stopped" pid=38020 ```
Author
Owner

@jessegross commented on GitHub (Jun 16, 2025):

There is an early preview of Ollama's new memory management with the goal of comprehensively fixing these issues. It is still in development, however, if you want to compile from source and try it out, you can find it here: https://github.com/ollama/ollama/pull/11090

Please leave any feedback on that PR.

<!-- gh-comment-id:2978263710 --> @jessegross commented on GitHub (Jun 16, 2025): There is an early preview of Ollama's new memory management with the goal of comprehensively fixing these issues. It is still in development, however, if you want to compile from source and try it out, you can find it here: https://github.com/ollama/ollama/pull/11090 Please leave any feedback on that PR.
Author
Owner

@maglat commented on GitHub (Jun 21, 2025):

Memory issue still persists with Mistral-Small-3.2

<!-- gh-comment-id:2993492582 --> @maglat commented on GitHub (Jun 21, 2025): Memory issue still persists with Mistral-Small-3.2
Author
Owner

@MarkWard0110 commented on GitHub (Jun 25, 2025):

There is an early preview of Ollama's new memory management with the goal of comprehensively fixing these issues. It is still in development, however, if you want to compile from source and try it out, you can find it here: #11090

Please leave any feedback on that PR.

I tried the branch and got the same issue.

<!-- gh-comment-id:3006548875 --> @MarkWard0110 commented on GitHub (Jun 25, 2025): > There is an early preview of Ollama's new memory management with the goal of comprehensively fixing these issues. It is still in development, however, if you want to compile from source and try it out, you can find it here: [#11090](https://github.com/ollama/ollama/pull/11090) > > Please leave any feedback on that PR. I tried the branch and got the same issue.
Author
Owner

@jessegross commented on GitHub (Sep 24, 2025):

I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.

<!-- gh-comment-id:3330108278 --> @jessegross commented on GitHub (Sep 24, 2025): I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53457