[GH-ISSUE #6416] Computer crashes after switching several Ollama models in a relatively short amount of time #50544

New Issue

GiteaMirror · 2026-04-28T16:18:46-05:00

GiteaMirror commented

2026-04-28 16:18:46 -05:00

Originally created by @elsatch on GitHub (Aug 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6416

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I love to run tests to compare different model outputs. To do so, I've used tools like promptfoo or langfuse (over Haystack or Langchain). In these tools, you set a list of models and then the program calls Ollama to load the models one after the other. I am using a Linux computer with Ubuntu 22.04 and an RTX3090.

After the program loads some of the big models (fp16, command-r or gemma2:27B), when it's loading the next model my computer freezes. I lose all access to the machine over ssh and it doesn't respond anymore. To solve the issue I have to hard reset it pressing the physical button.

This is a sample run when it crashes loading mistral-nemo:12b (but it has happened before with other models too, so this might not be model specific):

Cache is disabled.
Providers are running in serial with user input.
Running 1 evaluations for provider ollama:chat:command-r:latest with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:command-r:latest "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:gemma2:27b-instruct-q5_K_M with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:27b-instruct-q5_K_M "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:gemma2:2b-instruct-q8_0 with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:2b-instruct-q8_0 "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:gemma2:9b with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:9b "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:llama3:8b-instruct-fp16 with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:llama3:8b-instruct-fp16 "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:mayflowergmbh/occiglot-7b-eu5-instruct:latest with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:mayflowergmbh/occiglot-7b-eu5-instruct:latest "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:mistral-nemo:12b-instruct-2407-q8_0 with concurrency=4...
[░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0% | ETA: 0s | 0/1 | ""

Mistral-Nemo never loads successfully and the computer is not responsive. Local console if also frozen (I have Psensor displaying realtime metrics and... they are quite normal and frozen too).

Using other backends besides Ollama has not resulted in such crashes, but they are less convenient to use, so I'd love to debug this and find out what's making the OS crash.

Checking the logs I see no message related to the crash or the load of the new model:

NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.90.07  Fri May 31 09:35:42 UTC 2024
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model ftype      = Q4_0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model params     = 7.24 B
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: general.name     = mayflowergmbh
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: BOS token        = 1 '<s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOS token        = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: UNK token        = 0 '<unk>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: PAD token        = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: LF token         = 13 '<0x0A>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOT token        = 32001 '<|im_end|>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: max token length = 48
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: found 1 CUDA devices:
ago 19 10:47:19 ananke ollama[2292]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
ago 19 10:47:20 ananke ollama[2292]: llm_load_tensors: ggml ctx size =    0.27 MiB
ago 19 10:47:20 ananke ollama[2292]: time=2024-08-19T10:47:20.104+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading 32 repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading non-repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloaded 33/33 layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors:        CPU buffer size =    70.32 MiB
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors:      CUDA0 buffer size =  3847.56 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ctx      = 65536
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_batch    = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ubatch   = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: flash_attn = 0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_base  = 10000.0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_scale = 1
ago 19 10:47:23 ananke ollama[2292]: llama_kv_cache_init:      CUDA0 KV buffer size =  8192.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: KV self size  = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.55 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:      CUDA0 compute buffer size =  4256.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:  CUDA_Host compute buffer size =   136.01 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph nodes  = 1030
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph splits = 2
ago 19 10:47:23 ananke ollama[8085]: INFO [main] model loaded | tid="130282049744896" timestamp=1724057243
ago 19 10:47:23 ananke ollama[2292]: time=2024-08-19T10:47:23.622+02:00 level=INFO source=server.go:632 msg="llama runner started in 3.77 seconds"
ago 19 10:47:26 ananke ollama[2292]: [GIN] 2024/08/19 - 10:47:26 | 200 |  6.932276921s |       127.0.0.1 | POST     "/api/chat"
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: f_logit_scale    = 0.0e+00
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_ff             = 14336
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_expert         = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_expert_used    = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: causal attn      = 1
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: pooling type     = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope type        = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope scaling     = linear
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: freq_base_train  = 10000.0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: freq_scale_train = 1
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_ctx_orig_yarn  = 32768
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope_finetuned   = unknown
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_conv       = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_inner      = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_state      = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_dt_rank      = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model type       = 7B
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model ftype      = Q4_0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model params     = 7.24 B
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: general.name     = mayflowergmbh
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: BOS token        = 1 '<s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOS token        = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: UNK token        = 0 '<unk>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: PAD token        = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: LF token         = 13 '<0x0A>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOT token        = 32001 '<|im_end|>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: max token length = 48
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: found 1 CUDA devices:
ago 19 10:47:19 ananke ollama[2292]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
ago 19 10:47:20 ananke ollama[2292]: llm_load_tensors: ggml ctx size =    0.27 MiB
ago 19 10:47:20 ananke ollama[2292]: time=2024-08-19T10:47:20.104+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading 32 repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading non-repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloaded 33/33 layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors:        CPU buffer size =    70.32 MiB
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors:      CUDA0 buffer size =  3847.56 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ctx      = 65536
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_batch    = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ubatch   = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: flash_attn = 0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_base  = 10000.0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_scale = 1
ago 19 10:47:23 ananke ollama[2292]: llama_kv_cache_init:      CUDA0 KV buffer size =  8192.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: KV self size  = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.55 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:      CUDA0 compute buffer size =  4256.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:  CUDA_Host compute buffer size =   136.01 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph nodes  = 1030
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph splits = 2
ago 19 10:47:23 ananke ollama[8085]: INFO [main] model loaded | tid="130282049744896" timestamp=1724057243
ago 19 10:47:23 ananke ollama[2292]: time=2024-08-19T10:47:23.622+02:00 level=INFO source=server.go:632 msg="llama runner started in 3.77 seconds"
ago 19 10:47:26 ananke ollama[2292]: [GIN] 2024/08/19 - 10:47:26 | 200 |  6.932276921s |       127.0.0.1 | POST     "/api/chat"
ago 19 10:47:27 ananke ollama[2292]: time=2024-08-19T10:47:27.959+02:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-16d65981-a438-ac56-8bab-f1393824041b library=cuda total="23.7 GiB" available="6.9 GiB"

Any help to troubleshoot the issue will be greatly appreciated!

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.6

Originally created by @elsatch on GitHub (Aug 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6416 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I love to run tests to compare different model outputs. To do so, I've used tools like promptfoo or langfuse (over Haystack or Langchain). In these tools, you set a list of models and then the program calls Ollama to load the models one after the other. I am using a Linux computer with Ubuntu 22.04 and an RTX3090. After the program loads some of the big models (fp16, command-r or gemma2:27B), when it's loading the next model my computer freezes. I lose all access to the machine over ssh and it doesn't respond anymore. To solve the issue I have to hard reset it pressing the physical button. This is a sample run when it crashes loading mistral-nemo:12b (but it has happened before with other models too, so this might not be model specific): Cache is disabled. Providers are running in serial with user input. Running 1 evaluations for provider ollama:chat:command-r:latest with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:command-r:latest "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:gemma2:27b-instruct-q5_K_M with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:27b-instruct-q5_K_M "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:gemma2:2b-instruct-q8_0 with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:2b-instruct-q8_0 "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:gemma2:9b with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:9b "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:llama3:8b-instruct-fp16 with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:llama3:8b-instruct-fp16 "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:mayflowergmbh/occiglot-7b-eu5-instruct:latest with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:mayflowergmbh/occiglot-7b-eu5-instruct:latest "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:mistral-nemo:12b-instruct-2407-q8_0 with concurrency=4... [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0% | ETA: 0s | 0/1 | "" Mistral-Nemo never loads successfully and the computer is not responsive. Local console if also frozen (I have Psensor displaying realtime metrics and... they are quite normal and frozen too). Using other backends besides Ollama has not resulted in such crashes, but they are less convenient to use, so I'd love to debug this and find out what's making the OS crash. Checking the logs I see no message related to the crash or the load of the new model: ``` NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.90.07 Fri May 31 09:35:42 UTC 2024 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model ftype = Q4_0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model params = 7.24 B ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: general.name = mayflowergmbh ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: BOS token = 1 '<s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOS token = 2 '</s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: UNK token = 0 '<unk>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: PAD token = 2 '</s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: LF token = 13 '<0x0A>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOT token = 32001 '<|im_end|>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: max token length = 48 ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: found 1 CUDA devices: ago 19 10:47:19 ananke ollama[2292]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes ago 19 10:47:20 ananke ollama[2292]: llm_load_tensors: ggml ctx size = 0.27 MiB ago 19 10:47:20 ananke ollama[2292]: time=2024-08-19T10:47:20.104+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading 32 repeating layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading non-repeating layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloaded 33/33 layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CPU buffer size = 70.32 MiB ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CUDA0 buffer size = 3847.56 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ctx = 65536 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_batch = 512 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ubatch = 512 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: flash_attn = 0 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_base = 10000.0 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_scale = 1 ago 19 10:47:23 ananke ollama[2292]: llama_kv_cache_init: CUDA0 KV buffer size = 8192.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: KV self size = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host output buffer size = 0.55 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA0 compute buffer size = 4256.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host compute buffer size = 136.01 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph nodes = 1030 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph splits = 2 ago 19 10:47:23 ananke ollama[8085]: INFO [main] model loaded | tid="130282049744896" timestamp=1724057243 ago 19 10:47:23 ananke ollama[2292]: time=2024-08-19T10:47:23.622+02:00 level=INFO source=server.go:632 msg="llama runner started in 3.77 seconds" ago 19 10:47:26 ananke ollama[2292]: [GIN] 2024/08/19 - 10:47:26 | 200 | 6.932276921s | 127.0.0.1 | POST "/api/chat" ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: f_logit_scale = 0.0e+00 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_ff = 14336 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_expert = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_expert_used = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: causal attn = 1 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: pooling type = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope type = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope scaling = linear ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: freq_base_train = 10000.0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: freq_scale_train = 1 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_ctx_orig_yarn = 32768 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope_finetuned = unknown ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_conv = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_inner = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_state = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_dt_rank = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model type = 7B ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model ftype = Q4_0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model params = 7.24 B ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: general.name = mayflowergmbh ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: BOS token = 1 '<s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOS token = 2 '</s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: UNK token = 0 '<unk>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: PAD token = 2 '</s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: LF token = 13 '<0x0A>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOT token = 32001 '<|im_end|>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: max token length = 48 ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: found 1 CUDA devices: ago 19 10:47:19 ananke ollama[2292]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes ago 19 10:47:20 ananke ollama[2292]: llm_load_tensors: ggml ctx size = 0.27 MiB ago 19 10:47:20 ananke ollama[2292]: time=2024-08-19T10:47:20.104+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading 32 repeating layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading non-repeating layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloaded 33/33 layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CPU buffer size = 70.32 MiB ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CUDA0 buffer size = 3847.56 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ctx = 65536 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_batch = 512 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ubatch = 512 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: flash_attn = 0 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_base = 10000.0 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_scale = 1 ago 19 10:47:23 ananke ollama[2292]: llama_kv_cache_init: CUDA0 KV buffer size = 8192.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: KV self size = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host output buffer size = 0.55 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA0 compute buffer size = 4256.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host compute buffer size = 136.01 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph nodes = 1030 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph splits = 2 ago 19 10:47:23 ananke ollama[8085]: INFO [main] model loaded | tid="130282049744896" timestamp=1724057243 ago 19 10:47:23 ananke ollama[2292]: time=2024-08-19T10:47:23.622+02:00 level=INFO source=server.go:632 msg="llama runner started in 3.77 seconds" ago 19 10:47:26 ananke ollama[2292]: [GIN] 2024/08/19 - 10:47:26 | 200 | 6.932276921s | 127.0.0.1 | POST "/api/chat" ago 19 10:47:27 ananke ollama[2292]: time=2024-08-19T10:47:27.959+02:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-16d65981-a438-ac56-8bab-f1393824041b library=cuda total="23.7 GiB" available="6.9 GiB" ``` Any help to troubleshoot the issue will be greatly appreciated! ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.6

GiteaMirror added the needs more info bug nvidia linux labels 2026-04-28 16:18:48 -05:00

GiteaMirror closed this issue

2026-04-28 16:18:53 -05:00

GiteaMirror commented

2026-04-28 16:18:59 -05:00

@MaxJa4 commented on GitHub (Aug 19, 2024):

Freezing of the entire system sounds a lot like RAM overflowing. Since you use large models such as Command R and Mistral Nemo, it will spill over from GPU to CPU (RAM) with you GPU model (which is fine). But since you seem to also use huge context sizes (64k) you may run out of RAM. Or get too close to it. Have you monitored RAM usage of the affected system? Do you have swap enabled and set to a reasonable size (e.g. 8GB) so it can spill over instead of crashing your system?

I frequently use Open-Webui with 3-10 models at the same time (sequentially) to compare them, which causes loooots of reloads, never had any issues there.

@MaxJa4 commented on GitHub (Aug 19, 2024): Freezing of the entire system sounds a lot like RAM overflowing. Since you use large models such as Command R and Mistral Nemo, it will spill over from GPU to CPU (RAM) with you GPU model (which is fine). But since you seem to also use huge context sizes (64k) you may run out of RAM. Or get too close to it. Have you monitored RAM usage of the affected system? Do you have swap enabled and set to a reasonable size (e.g. 8GB) so it can spill over instead of crashing your system? I frequently use Open-Webui with 3-10 models at the same time (sequentially) to compare them, which causes loooots of reloads, never had any issues there.

GiteaMirror commented

2026-04-28 16:19:00 -05:00

@rick-github commented on GitHub (Aug 19, 2024):

The log appears incomplete, there should be lines like time=2024-08-19T10:30:24.515Z level=INFO source=memory.go:309 msg="offload to cuda" which would aid in investigating what's happening. Can you post the full log?

Based on llm_load_tensors: offloaded 33/33 layers to GPU, it would appear that mayflowergmbh was fully loaded on to the GPU. mistral-nemo:12b-instruct-2407-q8_0 is larger at 41 layers, which may or may not fit in the 24G of a RTX 3090 if other models are resident. Without the full log, it's hard to say.

@rick-github commented on GitHub (Aug 19, 2024): The log appears incomplete, there should be lines like `time=2024-08-19T10:30:24.515Z level=INFO source=memory.go:309 msg="offload to cuda"` which would aid in investigating what's happening. Can you post the full log? Based on `llm_load_tensors: offloaded 33/33 layers to GPU`, it would appear that mayflowergmbh was fully loaded on to the GPU. mistral-nemo:12b-instruct-2407-q8_0 is larger at 41 layers, which may or may not fit in the 24G of a RTX 3090 if other models are resident. Without the full log, it's hard to say.

GiteaMirror commented

2026-04-28 16:19:01 -05:00

@elsatch commented on GitHub (Aug 19, 2024):

Thanks @MaxJa4 for the idea of monitoring the RAM. I always choose models that would fit into the VRAM of the card, but given the context sizes it might be offloading some to RAM. I've 64 GB of RAM in this machine and the texts I am using are not too big but with several models... I'll check this too.

Regarding the full log @rick-github, do you mean just the Ollama service full log? I've attached the one corresponding to the previous run.
ollama_error_log.txt

The good news is I think it was related to some kind of driver mismatch. I have just updated all packages in my machine (using the Lambda Stack repos to ensure everything is consistent), removed all unused package and was able to run a 20+ models test without any crash. I need to check the logs to pinpoint some of the combinations that didn't work and some that worked.

@elsatch commented on GitHub (Aug 19, 2024): Thanks @MaxJa4 for the idea of monitoring the RAM. I always choose models that would fit into the VRAM of the card, but given the context sizes it might be offloading some to RAM. I've 64 GB of RAM in this machine and the texts I am using are not too big but with several models... I'll check this too. Regarding the full log @rick-github, do you mean just the Ollama service full log? I've attached the one corresponding to the previous run. [ollama_error_log.txt](https://github.com/user-attachments/files/16666822/ollama_error_log.txt) The good news is I think it was related to some kind of driver mismatch. I have just updated all packages in my machine (using the Lambda Stack repos to ensure everything is consistent), removed all unused package and was able to run a 20+ models test without any crash. I need to check the logs to pinpoint some of the combinations that didn't work and some that worked.

GiteaMirror commented

2026-04-28 16:19:03 -05:00

@viba1 commented on GitHub (Aug 30, 2024):

Hi!
I wanted to check if you have any updates regarding this bug. Has it been resolved, or is it still ongoing?
Thank you!

@viba1 commented on GitHub (Aug 30, 2024): Hi! I wanted to check if you have any updates regarding this bug. Has it been resolved, or is it still ongoing? Thank you!

GiteaMirror commented

2026-04-28 16:19:05 -05:00

@dhiltgen commented on GitHub (Sep 3, 2024):

@elsatch happy to hear the driver version cleanup appears to have solved it. I'll close this one for now, but if it you area able to reproduce, a few things that may help narrow it down:

Run ollama serve with OLLAMA_DEBUG=1 set so we get more verbose logging
Check your OS level logs for signs of driver hangs, etc.
Keep an eye on memory usage, and paging (if you have swap enabled)

If it does happen again, let us know and I'll re-open the issue.

@dhiltgen commented on GitHub (Sep 3, 2024): @elsatch happy to hear the driver version cleanup appears to have solved it. I'll close this one for now, but if it you area able to reproduce, a few things that may help narrow it down: - Run ollama serve with OLLAMA_DEBUG=1 set so we get more verbose logging - Check your OS level logs for signs of driver hangs, etc. - Keep an eye on memory usage, and paging (if you have swap enabled) If it does happen again, let us know and I'll re-open the issue.

GiteaMirror commented

2026-04-28 16:19:08 -05:00

@elsatch commented on GitHub (Oct 7, 2024):

Hi @dhiltgen,

It has taken a while but... now I am able to reproduce it! I have produced a script that simulates my RAG evaluations. I am generating the responses for ten chess related questions, passing three chunks as context (but they are not retrieved in relation to the output, but mostly to add around 300 tokens as context for the answer generation). I generates the ten answers using a model, then loads the next. Answers are generated sequentally, one after the other.

Whenever I run this doom_ollama.py script, my system comes to an absolute halt. I have made the system crash loading different combinations of models, so it's not model especific. This is my list of models:

llm_models": [
"hermes3:8b-llama3.1-q8_0",
"mistral-nemo:12b-instruct-2407-q8_0",
"swaroopsittula/dragon-llama3.1:latest",
"nemotron-mini:4b-instruct-q8_0",
"supernova:latest",
"phi3:14b-medium-128k-instruct-q5_K_M",
"qwen2.5:32b",
"mixtral:8x7b-instruct-v0.1-q3_K_M",
"gemma2:27b-instruct-q5_K_M",
"llama3.2:latest",
"mistral:7b-instruct-v0.3-q8_0",
"llama3.1:8b-instruct-q8_0",
],

I have used to ollama ps command and apparently, there is only one model at the time loaded at my GPU. Some models like Gemma consume part GPU VRAM part CPU RAM. But as one model gets loaded, the previous model does not appear anymore at the ollama ps output.

In the evaluation_datetime_logs created from the logging package, there are several questions and answers registered but it finishes up like this:

024-10-05 13:39:17,404 - __main__ - INFO - Generating answers for model: phi3:14b-medium-128k-instruct-q5_K_M
2024-10-05 13:39:19,234 - httpx - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2024-10-05 13:39:19,235 - __main__ - INFO - Answered question: How does the king move in chess?
2024-10-05 13:39:19,235 - __main__ - INFO - Response:  En el ajedrez, el rey puede moverse una casilla en cualquier dirección: horizontalmente, verticalmente o diagonalmente. Esto significa que tiene un total de 8 movimientos posibles desde su posición inicial (4 hacia adelante y 4 hacia los lados). Sin embargo, es importante recordar que la seguridad del rey es primordial en el juego, por lo que sus movimientos a menudo están limitados por las piezas enemigas.
2024-10-05 13:39:19,235 - __main__ - INFO - Generating answers for model: gemma2:27b-instruct-q5_K_M

One model is creating the answers, then switches to another one and the httpx request is never sent (or registered in the log).

I have checked the OS logs using journalctl, but the last line of the files is Ollama loading a model and then the crash. Is there any other additional information or test that might be useful to diagnose the root cause?

This is the doom_ollama.py script:

import os
import logging
from datetime import datetime
from typing import List
import itertools
import time
import random
import tempfile

from llama_index.llms.ollama import Ollama
from llama_index.core import PromptTemplate, Document
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

# Embedded hyperparameters
hyperparameters = {
    "document_parsers": {
        "pdf_reader": "PDFReader"
    },
    "node_parser": {
        "type": "SentenceSplitter",
        "chunk_size": 512,
        "chunk_overlap": 0
    },
    "embedding_model": {
        "name": "BAAI/bge-m3",
        "dimension": 1024
    },
    "vector_store": {
        "type": "MilvusVectorStore",
        "uri": "local_milvus_lite.db",
        "overwrite": True
    },
    "llm_models": [
        "hermes3:8b-llama3.1-q8_0",
        "mistral-nemo:12b-instruct-2407-q8_0",
        "swaroopsittula/dragon-llama3.1:latest",
        "nemotron-mini:4b-instruct-q8_0",
        "supernova:latest",
        "phi3:14b-medium-128k-instruct-q5_K_M",
        "qwen2.5:32b",
        "mixtral:8x7b-instruct-v0.1-q3_K_M",
        #"gemma2:27b-instruct-q5_K_M",
        "llama3.2:latest",
        "mistral:7b-instruct-v0.3-q8_0",
        "llama3.1:8b-instruct-q8_0",
        ],
    "prompt_templates": ["You are an artificial intelligence expert in chess. You will answer questions from various players clearly and concisely, using a series of excerpts from the rules of chess as a reference.\n\nSpecifically, you will use the following fragments as context:\n\n{context_str}\n\nto answer the question asked by a player. \n\n{query_str}\n\nRemember to always provide all the information you have in a clear and precise manner. If you don't have information to answer the question, respond \"I don't have information to answer that question\"."
    ],
    "similarity_top_k": [4],
    "evaluation_metrics": [
        "AnswerRelevancyMetric",
        "FaithfulnessMetric"
    ],
    "grader": ["mistral-nemo:12b-instruct-2407-q8_0"],
    "temperature": [0]
}

# Set up logging
logs_dir = "logs"
os.makedirs(logs_dir, exist_ok=True)
current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
log_file_name = f"evaluation_{current_time}.log"
log_file_path = os.path.join(logs_dir, log_file_name)

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_file_path),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)
logger.info(f"Logging to file: {log_file_path}")

# Fixed chess knowledge chunks
chess_knowledge = [
    "Chess is a two-player strategy game played on a board with 64 squares arranged in an 8x8 grid. Each player starts with 16 pieces: one king, one queen, two rooks, two knights, two bishops, and eight pawns. The objective is to checkmate the opponent's king.",
    "In chess, pieces move in specific ways. The king can move one square in any direction. The queen can move any number of squares in any direction. Rooks move horizontally or vertically, bishops move diagonally, and knights move in an L-shape. Pawns typically move forward one square at a time.",
    "Special chess rules include castling, where the king and rook can move simultaneously under certain conditions, and en passant, a special pawn capture. Pawns can also be promoted to any other piece (except a king) upon reaching the opposite end of the board."
]

# Chess questions
chess_questions = [
    "How does the knight move in chess?",
    "What is the objective of a chess game?",
    "How many squares are on a standard chess board?",
    "What is castling in chess?",
    "How does the queen move in chess?",
    "What is en passant in chess?",
    "How many pieces does each player start with in chess?",
    "How do pawns capture other pieces in chess?",
    "What happens when a pawn reaches the opposite end of the board?",
    "How does the king move in chess?"
]

def setup_milvus_lite():
    # Create a temporary directory for Milvus Lite
    temp_dir = tempfile.mkdtemp(prefix="milvus_lite_")
    
    # Set up Milvus Lite vector store
    vector_store = MilvusVectorStore(
        dim=hyperparameters['embedding_model']['dimension'],
        db_name="chess_knowledge",
        collection_name="chess_rules",
        uri="temp_milvus.db",
        overwrite=True
    )
    
    # Set up embedding model
    embed_model = HuggingFaceEmbedding(model_name=hyperparameters['embedding_model']['name'])
    
    # Create storage context and index
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    nodes = [Document(text=chunk) for chunk in chess_knowledge]
    index = VectorStoreIndex(nodes, storage_context=storage_context, embed_model=embed_model)
    
    return index, temp_dir

def get_ollama_response(model: str, prompt: str, question: str, index: VectorStoreIndex):
    llm = Ollama(model=model, request_timeout=60, temperature=hyperparameters['temperature'][0])
    qa_prompt = PromptTemplate(prompt)
    
    retriever = VectorIndexRetriever(index=index, similarity_top_k=hyperparameters['similarity_top_k'][0])
    response_synthesizer = get_response_synthesizer(llm=llm, text_qa_template=qa_prompt, response_mode="compact")
    query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=response_synthesizer)
    
    response = query_engine.query(question)
    return response.response

# Question answering with Ollama loop

llm_models = hyperparameters['llm_models']
prompt_templates = hyperparameters['prompt_templates']

index, temp_dir = setup_milvus_lite()

for model in llm_models:
        for prompt in prompt_templates:
            for question in chess_questions:
                logger.info(f"Generating answers for model: {model}")
                response = get_ollama_response(model, prompt, question, index)
                logger.info(f"Answered question: {question}")
                logger.info(f"Response: {response}")

ollama_error_log4.log
ollama_error_log3.log
ollama_error_log2.log
ollama_error_log1.log

evaluation_20241006_105957.log
evaluation_20241006_092659.log
evaluation_20241005_221027.log
evaluation_20241005_133747.log

@elsatch commented on GitHub (Oct 7, 2024): Hi @dhiltgen, It has taken a while but... now I am able to reproduce it! I have produced a script that simulates my RAG evaluations. I am generating the responses for ten chess related questions, passing three chunks as context (but they are not retrieved in relation to the output, but mostly to add around 300 tokens as context for the answer generation). I generates the ten answers using a model, then loads the next. Answers are generated sequentally, one after the other. Whenever I run this doom_ollama.py script, my system comes to an absolute halt. I have made the system crash loading different combinations of models, so it's not model especific. This is my list of models: llm_models": [ "hermes3:8b-llama3.1-q8_0", "mistral-nemo:12b-instruct-2407-q8_0", "swaroopsittula/dragon-llama3.1:latest", "nemotron-mini:4b-instruct-q8_0", "supernova:latest", "phi3:14b-medium-128k-instruct-q5_K_M", "qwen2.5:32b", "mixtral:8x7b-instruct-v0.1-q3_K_M", "gemma2:27b-instruct-q5_K_M", "llama3.2:latest", "mistral:7b-instruct-v0.3-q8_0", "llama3.1:8b-instruct-q8_0", ], I have used to `ollama ps` command and apparently, there is only one model at the time loaded at my GPU. Some models like Gemma consume part GPU VRAM part CPU RAM. But as one model gets loaded, the previous model does not appear anymore at the `ollama ps` output. In the evaluation_datetime_logs created from the logging package, there are several questions and answers registered but it finishes up like this: ```console 024-10-05 13:39:17,404 - __main__ - INFO - Generating answers for model: phi3:14b-medium-128k-instruct-q5_K_M 2024-10-05 13:39:19,234 - httpx - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK" 2024-10-05 13:39:19,235 - __main__ - INFO - Answered question: How does the king move in chess? 2024-10-05 13:39:19,235 - __main__ - INFO - Response: En el ajedrez, el rey puede moverse una casilla en cualquier dirección: horizontalmente, verticalmente o diagonalmente. Esto significa que tiene un total de 8 movimientos posibles desde su posición inicial (4 hacia adelante y 4 hacia los lados). Sin embargo, es importante recordar que la seguridad del rey es primordial en el juego, por lo que sus movimientos a menudo están limitados por las piezas enemigas. 2024-10-05 13:39:19,235 - __main__ - INFO - Generating answers for model: gemma2:27b-instruct-q5_K_M ``` One model is creating the answers, then switches to another one and the httpx request is never sent (or registered in the log). I have checked the OS logs using journalctl, but the last line of the files is Ollama loading a model and then the crash. Is there any other additional information or test that might be useful to diagnose the root cause? This is the doom_ollama.py script: ```python import os import logging from datetime import datetime from typing import List import itertools import time import random import tempfile from llama_index.llms.ollama import Ollama from llama_index.core import PromptTemplate, Document from llama_index.core import StorageContext, VectorStoreIndex from llama_index.vector_stores.milvus import MilvusVectorStore from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.core.retrievers import VectorIndexRetriever from llama_index.core.query_engine import RetrieverQueryEngine from llama_index.core import get_response_synthesizer from deepeval import evaluate from deepeval.test_case import LLMTestCase from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric # Embedded hyperparameters hyperparameters = { "document_parsers": { "pdf_reader": "PDFReader" }, "node_parser": { "type": "SentenceSplitter", "chunk_size": 512, "chunk_overlap": 0 }, "embedding_model": { "name": "BAAI/bge-m3", "dimension": 1024 }, "vector_store": { "type": "MilvusVectorStore", "uri": "local_milvus_lite.db", "overwrite": True }, "llm_models": [ "hermes3:8b-llama3.1-q8_0", "mistral-nemo:12b-instruct-2407-q8_0", "swaroopsittula/dragon-llama3.1:latest", "nemotron-mini:4b-instruct-q8_0", "supernova:latest", "phi3:14b-medium-128k-instruct-q5_K_M", "qwen2.5:32b", "mixtral:8x7b-instruct-v0.1-q3_K_M", #"gemma2:27b-instruct-q5_K_M", "llama3.2:latest", "mistral:7b-instruct-v0.3-q8_0", "llama3.1:8b-instruct-q8_0", ], "prompt_templates": ["You are an artificial intelligence expert in chess. You will answer questions from various players clearly and concisely, using a series of excerpts from the rules of chess as a reference.\n\nSpecifically, you will use the following fragments as context:\n\n{context_str}\n\nto answer the question asked by a player. \n\n{query_str}\n\nRemember to always provide all the information you have in a clear and precise manner. If you don't have information to answer the question, respond \"I don't have information to answer that question\"." ], "similarity_top_k": [4], "evaluation_metrics": [ "AnswerRelevancyMetric", "FaithfulnessMetric" ], "grader": ["mistral-nemo:12b-instruct-2407-q8_0"], "temperature": [0] } # Set up logging logs_dir = "logs" os.makedirs(logs_dir, exist_ok=True) current_time = datetime.now().strftime("%Y%m%d_%H%M%S") log_file_name = f"evaluation_{current_time}.log" log_file_path = os.path.join(logs_dir, log_file_name) logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler(log_file_path), logging.StreamHandler() ] ) logger = logging.getLogger(__name__) logger.info(f"Logging to file: {log_file_path}") # Fixed chess knowledge chunks chess_knowledge = [ "Chess is a two-player strategy game played on a board with 64 squares arranged in an 8x8 grid. Each player starts with 16 pieces: one king, one queen, two rooks, two knights, two bishops, and eight pawns. The objective is to checkmate the opponent's king.", "In chess, pieces move in specific ways. The king can move one square in any direction. The queen can move any number of squares in any direction. Rooks move horizontally or vertically, bishops move diagonally, and knights move in an L-shape. Pawns typically move forward one square at a time.", "Special chess rules include castling, where the king and rook can move simultaneously under certain conditions, and en passant, a special pawn capture. Pawns can also be promoted to any other piece (except a king) upon reaching the opposite end of the board." ] # Chess questions chess_questions = [ "How does the knight move in chess?", "What is the objective of a chess game?", "How many squares are on a standard chess board?", "What is castling in chess?", "How does the queen move in chess?", "What is en passant in chess?", "How many pieces does each player start with in chess?", "How do pawns capture other pieces in chess?", "What happens when a pawn reaches the opposite end of the board?", "How does the king move in chess?" ] def setup_milvus_lite(): # Create a temporary directory for Milvus Lite temp_dir = tempfile.mkdtemp(prefix="milvus_lite_") # Set up Milvus Lite vector store vector_store = MilvusVectorStore( dim=hyperparameters['embedding_model']['dimension'], db_name="chess_knowledge", collection_name="chess_rules", uri="temp_milvus.db", overwrite=True ) # Set up embedding model embed_model = HuggingFaceEmbedding(model_name=hyperparameters['embedding_model']['name']) # Create storage context and index storage_context = StorageContext.from_defaults(vector_store=vector_store) nodes = [Document(text=chunk) for chunk in chess_knowledge] index = VectorStoreIndex(nodes, storage_context=storage_context, embed_model=embed_model) return index, temp_dir def get_ollama_response(model: str, prompt: str, question: str, index: VectorStoreIndex): llm = Ollama(model=model, request_timeout=60, temperature=hyperparameters['temperature'][0]) qa_prompt = PromptTemplate(prompt) retriever = VectorIndexRetriever(index=index, similarity_top_k=hyperparameters['similarity_top_k'][0]) response_synthesizer = get_response_synthesizer(llm=llm, text_qa_template=qa_prompt, response_mode="compact") query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=response_synthesizer) response = query_engine.query(question) return response.response # Question answering with Ollama loop llm_models = hyperparameters['llm_models'] prompt_templates = hyperparameters['prompt_templates'] index, temp_dir = setup_milvus_lite() for model in llm_models: for prompt in prompt_templates: for question in chess_questions: logger.info(f"Generating answers for model: {model}") response = get_ollama_response(model, prompt, question, index) logger.info(f"Answered question: {question}") logger.info(f"Response: {response}") ``` [ollama_error_log4.log](https://github.com/user-attachments/files/17271462/ollama_error_log4.log) [ollama_error_log3.log](https://github.com/user-attachments/files/17271463/ollama_error_log3.log) [ollama_error_log2.log](https://github.com/user-attachments/files/17271464/ollama_error_log2.log) [ollama_error_log1.log](https://github.com/user-attachments/files/17271465/ollama_error_log1.log) [evaluation_20241006_105957.log](https://github.com/user-attachments/files/17271482/evaluation_20241006_105957.log) [evaluation_20241006_092659.log](https://github.com/user-attachments/files/17271483/evaluation_20241006_092659.log) [evaluation_20241005_221027.log](https://github.com/user-attachments/files/17271484/evaluation_20241005_221027.log) [evaluation_20241005_133747.log](https://github.com/user-attachments/files/17271485/evaluation_20241005_133747.log)

GiteaMirror commented

2026-04-28 16:19:13 -05:00

@dhiltgen commented on GitHub (Oct 17, 2024):

A crash of the computer may be hardware problems or driver bugs. Now that you've got a repro scenario, lets see if we can get some details on that side.

The following may yield some details from the nvidia driver or linux kernel about hangs/panics/etc. (start this before you begin your repro scenario and hopefully just as the system hangs you'll see something interesting)

sudo dmesg -w

You can also set CUDA_ERROR_LEVEL=50 for the ollama server, and that may generate some additional log messages in the ollama logs from cuda library failures leading up to the hang/crash.

@dhiltgen commented on GitHub (Oct 17, 2024): A crash of the computer may be hardware problems or driver bugs. Now that you've got a repro scenario, lets see if we can get some details on that side. The following may yield some details from the nvidia driver or linux kernel about hangs/panics/etc. (start this before you begin your repro scenario and hopefully just as the system hangs you'll see something interesting) ``` sudo dmesg -w ``` You can also set `CUDA_ERROR_LEVEL=50` for the ollama server, and that may generate some additional log messages in the ollama logs from cuda library failures leading up to the hang/crash.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#50544