[GH-ISSUE #6416] Computer crashes after switching several Ollama models in a relatively short amount of time #50544

Closed
opened 2026-04-28 16:18:46 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @elsatch on GitHub (Aug 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6416

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I love to run tests to compare different model outputs. To do so, I've used tools like promptfoo or langfuse (over Haystack or Langchain). In these tools, you set a list of models and then the program calls Ollama to load the models one after the other. I am using a Linux computer with Ubuntu 22.04 and an RTX3090.

After the program loads some of the big models (fp16, command-r or gemma2:27B), when it's loading the next model my computer freezes. I lose all access to the machine over ssh and it doesn't respond anymore. To solve the issue I have to hard reset it pressing the physical button.

This is a sample run when it crashes loading mistral-nemo:12b (but it has happened before with other models too, so this might not be model specific):

Cache is disabled.
Providers are running in serial with user input.
Running 1 evaluations for provider ollama:chat:command-r:latest with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:command-r:latest "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:gemma2:27b-instruct-q5_K_M with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:27b-instruct-q5_K_M "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:gemma2:2b-instruct-q8_0 with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:2b-instruct-q8_0 "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:gemma2:9b with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:9b "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:llama3:8b-instruct-fp16 with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:llama3:8b-instruct-fp16 "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:mayflowergmbh/occiglot-7b-eu5-instruct:latest with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:mayflowergmbh/occiglot-7b-eu5-instruct:latest "Eres un as" lista_ingr

Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:mistral-nemo:12b-instruct-2407-q8_0 with concurrency=4...
[░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0% | ETA: 0s | 0/1 | ""

Mistral-Nemo never loads successfully and the computer is not responsive. Local console if also frozen (I have Psensor displaying realtime metrics and... they are quite normal and frozen too).

Using other backends besides Ollama has not resulted in such crashes, but they are less convenient to use, so I'd love to debug this and find out what's making the OS crash.

Checking the logs I see no message related to the crash or the load of the new model:

NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.90.07  Fri May 31 09:35:42 UTC 2024
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model ftype      = Q4_0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model params     = 7.24 B
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: general.name     = mayflowergmbh
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: BOS token        = 1 '<s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOS token        = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: UNK token        = 0 '<unk>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: PAD token        = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: LF token         = 13 '<0x0A>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOT token        = 32001 '<|im_end|>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: max token length = 48
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: found 1 CUDA devices:
ago 19 10:47:19 ananke ollama[2292]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
ago 19 10:47:20 ananke ollama[2292]: llm_load_tensors: ggml ctx size =    0.27 MiB
ago 19 10:47:20 ananke ollama[2292]: time=2024-08-19T10:47:20.104+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading 32 repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading non-repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloaded 33/33 layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors:        CPU buffer size =    70.32 MiB
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors:      CUDA0 buffer size =  3847.56 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ctx      = 65536
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_batch    = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ubatch   = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: flash_attn = 0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_base  = 10000.0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_scale = 1
ago 19 10:47:23 ananke ollama[2292]: llama_kv_cache_init:      CUDA0 KV buffer size =  8192.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: KV self size  = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.55 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:      CUDA0 compute buffer size =  4256.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:  CUDA_Host compute buffer size =   136.01 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph nodes  = 1030
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph splits = 2
ago 19 10:47:23 ananke ollama[8085]: INFO [main] model loaded | tid="130282049744896" timestamp=1724057243
ago 19 10:47:23 ananke ollama[2292]: time=2024-08-19T10:47:23.622+02:00 level=INFO source=server.go:632 msg="llama runner started in 3.77 seconds"
ago 19 10:47:26 ananke ollama[2292]: [GIN] 2024/08/19 - 10:47:26 | 200 |  6.932276921s |       127.0.0.1 | POST     "/api/chat"
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: f_logit_scale    = 0.0e+00
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_ff             = 14336
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_expert         = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_expert_used    = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: causal attn      = 1
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: pooling type     = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope type        = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope scaling     = linear
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: freq_base_train  = 10000.0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: freq_scale_train = 1
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_ctx_orig_yarn  = 32768
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope_finetuned   = unknown
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_conv       = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_inner      = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_state      = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_dt_rank      = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model type       = 7B
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model ftype      = Q4_0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model params     = 7.24 B
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: general.name     = mayflowergmbh
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: BOS token        = 1 '<s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOS token        = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: UNK token        = 0 '<unk>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: PAD token        = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: LF token         = 13 '<0x0A>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOT token        = 32001 '<|im_end|>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: max token length = 48
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: found 1 CUDA devices:
ago 19 10:47:19 ananke ollama[2292]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
ago 19 10:47:20 ananke ollama[2292]: llm_load_tensors: ggml ctx size =    0.27 MiB
ago 19 10:47:20 ananke ollama[2292]: time=2024-08-19T10:47:20.104+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading 32 repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading non-repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloaded 33/33 layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors:        CPU buffer size =    70.32 MiB
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors:      CUDA0 buffer size =  3847.56 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ctx      = 65536
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_batch    = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ubatch   = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: flash_attn = 0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_base  = 10000.0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_scale = 1
ago 19 10:47:23 ananke ollama[2292]: llama_kv_cache_init:      CUDA0 KV buffer size =  8192.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: KV self size  = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.55 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:      CUDA0 compute buffer size =  4256.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model:  CUDA_Host compute buffer size =   136.01 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph nodes  = 1030
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph splits = 2
ago 19 10:47:23 ananke ollama[8085]: INFO [main] model loaded | tid="130282049744896" timestamp=1724057243
ago 19 10:47:23 ananke ollama[2292]: time=2024-08-19T10:47:23.622+02:00 level=INFO source=server.go:632 msg="llama runner started in 3.77 seconds"
ago 19 10:47:26 ananke ollama[2292]: [GIN] 2024/08/19 - 10:47:26 | 200 |  6.932276921s |       127.0.0.1 | POST     "/api/chat"
ago 19 10:47:27 ananke ollama[2292]: time=2024-08-19T10:47:27.959+02:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-16d65981-a438-ac56-8bab-f1393824041b library=cuda total="23.7 GiB" available="6.9 GiB"

Any help to troubleshoot the issue will be greatly appreciated!

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.6

Originally created by @elsatch on GitHub (Aug 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6416 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I love to run tests to compare different model outputs. To do so, I've used tools like promptfoo or langfuse (over Haystack or Langchain). In these tools, you set a list of models and then the program calls Ollama to load the models one after the other. I am using a Linux computer with Ubuntu 22.04 and an RTX3090. After the program loads some of the big models (fp16, command-r or gemma2:27B), when it's loading the next model my computer freezes. I lose all access to the machine over ssh and it doesn't respond anymore. To solve the issue I have to hard reset it pressing the physical button. This is a sample run when it crashes loading mistral-nemo:12b (but it has happened before with other models too, so this might not be model specific): Cache is disabled. Providers are running in serial with user input. Running 1 evaluations for provider ollama:chat:command-r:latest with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:command-r:latest "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:gemma2:27b-instruct-q5_K_M with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:27b-instruct-q5_K_M "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:gemma2:2b-instruct-q8_0 with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:2b-instruct-q8_0 "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:gemma2:9b with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:9b "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:llama3:8b-instruct-fp16 with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:llama3:8b-instruct-fp16 "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:mayflowergmbh/occiglot-7b-eu5-instruct:latest with concurrency=4... [████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:mayflowergmbh/occiglot-7b-eu5-instruct:latest "Eres un as" lista_ingr Ready to continue to the next provider? (Y/n) Running 1 evaluations for provider ollama:chat:mistral-nemo:12b-instruct-2407-q8_0 with concurrency=4... [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0% | ETA: 0s | 0/1 | "" Mistral-Nemo never loads successfully and the computer is not responsive. Local console if also frozen (I have Psensor displaying realtime metrics and... they are quite normal and frozen too). Using other backends besides Ollama has not resulted in such crashes, but they are less convenient to use, so I'd love to debug this and find out what's making the OS crash. Checking the logs I see no message related to the crash or the load of the new model: ``` NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.90.07 Fri May 31 09:35:42 UTC 2024 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model ftype = Q4_0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model params = 7.24 B ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: general.name = mayflowergmbh ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: BOS token = 1 '<s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOS token = 2 '</s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: UNK token = 0 '<unk>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: PAD token = 2 '</s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: LF token = 13 '<0x0A>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOT token = 32001 '<|im_end|>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: max token length = 48 ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: found 1 CUDA devices: ago 19 10:47:19 ananke ollama[2292]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes ago 19 10:47:20 ananke ollama[2292]: llm_load_tensors: ggml ctx size = 0.27 MiB ago 19 10:47:20 ananke ollama[2292]: time=2024-08-19T10:47:20.104+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading 32 repeating layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading non-repeating layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloaded 33/33 layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CPU buffer size = 70.32 MiB ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CUDA0 buffer size = 3847.56 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ctx = 65536 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_batch = 512 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ubatch = 512 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: flash_attn = 0 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_base = 10000.0 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_scale = 1 ago 19 10:47:23 ananke ollama[2292]: llama_kv_cache_init: CUDA0 KV buffer size = 8192.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: KV self size = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host output buffer size = 0.55 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA0 compute buffer size = 4256.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host compute buffer size = 136.01 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph nodes = 1030 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph splits = 2 ago 19 10:47:23 ananke ollama[8085]: INFO [main] model loaded | tid="130282049744896" timestamp=1724057243 ago 19 10:47:23 ananke ollama[2292]: time=2024-08-19T10:47:23.622+02:00 level=INFO source=server.go:632 msg="llama runner started in 3.77 seconds" ago 19 10:47:26 ananke ollama[2292]: [GIN] 2024/08/19 - 10:47:26 | 200 | 6.932276921s | 127.0.0.1 | POST "/api/chat" ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: f_logit_scale = 0.0e+00 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_ff = 14336 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_expert = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_expert_used = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: causal attn = 1 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: pooling type = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope type = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope scaling = linear ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: freq_base_train = 10000.0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: freq_scale_train = 1 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_ctx_orig_yarn = 32768 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope_finetuned = unknown ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_conv = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_inner = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_state = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_dt_rank = 0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model type = 7B ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model ftype = Q4_0 ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model params = 7.24 B ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: general.name = mayflowergmbh ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: BOS token = 1 '<s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOS token = 2 '</s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: UNK token = 0 '<unk>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: PAD token = 2 '</s>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: LF token = 13 '<0x0A>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOT token = 32001 '<|im_end|>' ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: max token length = 48 ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: found 1 CUDA devices: ago 19 10:47:19 ananke ollama[2292]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes ago 19 10:47:20 ananke ollama[2292]: llm_load_tensors: ggml ctx size = 0.27 MiB ago 19 10:47:20 ananke ollama[2292]: time=2024-08-19T10:47:20.104+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading 32 repeating layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading non-repeating layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloaded 33/33 layers to GPU ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CPU buffer size = 70.32 MiB ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CUDA0 buffer size = 3847.56 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ctx = 65536 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_batch = 512 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ubatch = 512 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: flash_attn = 0 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_base = 10000.0 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_scale = 1 ago 19 10:47:23 ananke ollama[2292]: llama_kv_cache_init: CUDA0 KV buffer size = 8192.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: KV self size = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host output buffer size = 0.55 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA0 compute buffer size = 4256.00 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host compute buffer size = 136.01 MiB ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph nodes = 1030 ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph splits = 2 ago 19 10:47:23 ananke ollama[8085]: INFO [main] model loaded | tid="130282049744896" timestamp=1724057243 ago 19 10:47:23 ananke ollama[2292]: time=2024-08-19T10:47:23.622+02:00 level=INFO source=server.go:632 msg="llama runner started in 3.77 seconds" ago 19 10:47:26 ananke ollama[2292]: [GIN] 2024/08/19 - 10:47:26 | 200 | 6.932276921s | 127.0.0.1 | POST "/api/chat" ago 19 10:47:27 ananke ollama[2292]: time=2024-08-19T10:47:27.959+02:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-16d65981-a438-ac56-8bab-f1393824041b library=cuda total="23.7 GiB" available="6.9 GiB" ``` Any help to troubleshoot the issue will be greatly appreciated! ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.6
GiteaMirror added the needs more infobugnvidialinux labels 2026-04-28 16:18:48 -05:00
Author
Owner

@MaxJa4 commented on GitHub (Aug 19, 2024):

Freezing of the entire system sounds a lot like RAM overflowing. Since you use large models such as Command R and Mistral Nemo, it will spill over from GPU to CPU (RAM) with you GPU model (which is fine). But since you seem to also use huge context sizes (64k) you may run out of RAM. Or get too close to it. Have you monitored RAM usage of the affected system? Do you have swap enabled and set to a reasonable size (e.g. 8GB) so it can spill over instead of crashing your system?

I frequently use Open-Webui with 3-10 models at the same time (sequentially) to compare them, which causes loooots of reloads, never had any issues there.

<!-- gh-comment-id:2296071598 --> @MaxJa4 commented on GitHub (Aug 19, 2024): Freezing of the entire system sounds a lot like RAM overflowing. Since you use large models such as Command R and Mistral Nemo, it will spill over from GPU to CPU (RAM) with you GPU model (which is fine). But since you seem to also use huge context sizes (64k) you may run out of RAM. Or get too close to it. Have you monitored RAM usage of the affected system? Do you have swap enabled and set to a reasonable size (e.g. 8GB) so it can spill over instead of crashing your system? I frequently use Open-Webui with 3-10 models at the same time (sequentially) to compare them, which causes loooots of reloads, never had any issues there.
Author
Owner

@rick-github commented on GitHub (Aug 19, 2024):

The log appears incomplete, there should be lines like time=2024-08-19T10:30:24.515Z level=INFO source=memory.go:309 msg="offload to cuda" which would aid in investigating what's happening. Can you post the full log?

Based on llm_load_tensors: offloaded 33/33 layers to GPU, it would appear that mayflowergmbh was fully loaded on to the GPU. mistral-nemo:12b-instruct-2407-q8_0 is larger at 41 layers, which may or may not fit in the 24G of a RTX 3090 if other models are resident. Without the full log, it's hard to say.

<!-- gh-comment-id:2296262335 --> @rick-github commented on GitHub (Aug 19, 2024): The log appears incomplete, there should be lines like `time=2024-08-19T10:30:24.515Z level=INFO source=memory.go:309 msg="offload to cuda"` which would aid in investigating what's happening. Can you post the full log? Based on `llm_load_tensors: offloaded 33/33 layers to GPU`, it would appear that mayflowergmbh was fully loaded on to the GPU. mistral-nemo:12b-instruct-2407-q8_0 is larger at 41 layers, which may or may not fit in the 24G of a RTX 3090 if other models are resident. Without the full log, it's hard to say.
Author
Owner

@elsatch commented on GitHub (Aug 19, 2024):

Thanks @MaxJa4 for the idea of monitoring the RAM. I always choose models that would fit into the VRAM of the card, but given the context sizes it might be offloading some to RAM. I've 64 GB of RAM in this machine and the texts I am using are not too big but with several models... I'll check this too.

Regarding the full log @rick-github, do you mean just the Ollama service full log? I've attached the one corresponding to the previous run.
ollama_error_log.txt

The good news is I think it was related to some kind of driver mismatch. I have just updated all packages in my machine (using the Lambda Stack repos to ensure everything is consistent), removed all unused package and was able to run a 20+ models test without any crash. I need to check the logs to pinpoint some of the combinations that didn't work and some that worked.

<!-- gh-comment-id:2297679840 --> @elsatch commented on GitHub (Aug 19, 2024): Thanks @MaxJa4 for the idea of monitoring the RAM. I always choose models that would fit into the VRAM of the card, but given the context sizes it might be offloading some to RAM. I've 64 GB of RAM in this machine and the texts I am using are not too big but with several models... I'll check this too. Regarding the full log @rick-github, do you mean just the Ollama service full log? I've attached the one corresponding to the previous run. [ollama_error_log.txt](https://github.com/user-attachments/files/16666822/ollama_error_log.txt) The good news is I think it was related to some kind of driver mismatch. I have just updated all packages in my machine (using the Lambda Stack repos to ensure everything is consistent), removed all unused package and was able to run a 20+ models test without any crash. I need to check the logs to pinpoint some of the combinations that didn't work and some that worked.
Author
Owner

@viba1 commented on GitHub (Aug 30, 2024):

Hi!
I wanted to check if you have any updates regarding this bug. Has it been resolved, or is it still ongoing?
Thank you!

<!-- gh-comment-id:2322351806 --> @viba1 commented on GitHub (Aug 30, 2024): Hi! I wanted to check if you have any updates regarding this bug. Has it been resolved, or is it still ongoing? Thank you!
Author
Owner

@dhiltgen commented on GitHub (Sep 3, 2024):

@elsatch happy to hear the driver version cleanup appears to have solved it. I'll close this one for now, but if it you area able to reproduce, a few things that may help narrow it down:

  • Run ollama serve with OLLAMA_DEBUG=1 set so we get more verbose logging
  • Check your OS level logs for signs of driver hangs, etc.
  • Keep an eye on memory usage, and paging (if you have swap enabled)

If it does happen again, let us know and I'll re-open the issue.

<!-- gh-comment-id:2327627676 --> @dhiltgen commented on GitHub (Sep 3, 2024): @elsatch happy to hear the driver version cleanup appears to have solved it. I'll close this one for now, but if it you area able to reproduce, a few things that may help narrow it down: - Run ollama serve with OLLAMA_DEBUG=1 set so we get more verbose logging - Check your OS level logs for signs of driver hangs, etc. - Keep an eye on memory usage, and paging (if you have swap enabled) If it does happen again, let us know and I'll re-open the issue.
Author
Owner

@elsatch commented on GitHub (Oct 7, 2024):

Hi @dhiltgen,

It has taken a while but... now I am able to reproduce it! I have produced a script that simulates my RAG evaluations. I am generating the responses for ten chess related questions, passing three chunks as context (but they are not retrieved in relation to the output, but mostly to add around 300 tokens as context for the answer generation). I generates the ten answers using a model, then loads the next. Answers are generated sequentally, one after the other.

Whenever I run this doom_ollama.py script, my system comes to an absolute halt. I have made the system crash loading different combinations of models, so it's not model especific. This is my list of models:

llm_models": [
"hermes3:8b-llama3.1-q8_0",
"mistral-nemo:12b-instruct-2407-q8_0",
"swaroopsittula/dragon-llama3.1:latest",
"nemotron-mini:4b-instruct-q8_0",
"supernova:latest",
"phi3:14b-medium-128k-instruct-q5_K_M",
"qwen2.5:32b",
"mixtral:8x7b-instruct-v0.1-q3_K_M",
"gemma2:27b-instruct-q5_K_M",
"llama3.2:latest",
"mistral:7b-instruct-v0.3-q8_0",
"llama3.1:8b-instruct-q8_0",
],

I have used to ollama ps command and apparently, there is only one model at the time loaded at my GPU. Some models like Gemma consume part GPU VRAM part CPU RAM. But as one model gets loaded, the previous model does not appear anymore at the ollama ps output.

In the evaluation_datetime_logs created from the logging package, there are several questions and answers registered but it finishes up like this:

024-10-05 13:39:17,404 - __main__ - INFO - Generating answers for model: phi3:14b-medium-128k-instruct-q5_K_M
2024-10-05 13:39:19,234 - httpx - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2024-10-05 13:39:19,235 - __main__ - INFO - Answered question: How does the king move in chess?
2024-10-05 13:39:19,235 - __main__ - INFO - Response:  En el ajedrez, el rey puede moverse una casilla en cualquier dirección: horizontalmente, verticalmente o diagonalmente. Esto significa que tiene un total de 8 movimientos posibles desde su posición inicial (4 hacia adelante y 4 hacia los lados). Sin embargo, es importante recordar que la seguridad del rey es primordial en el juego, por lo que sus movimientos a menudo están limitados por las piezas enemigas.
2024-10-05 13:39:19,235 - __main__ - INFO - Generating answers for model: gemma2:27b-instruct-q5_K_M

One model is creating the answers, then switches to another one and the httpx request is never sent (or registered in the log).

I have checked the OS logs using journalctl, but the last line of the files is Ollama loading a model and then the crash. Is there any other additional information or test that might be useful to diagnose the root cause?

This is the doom_ollama.py script:

import os
import logging
from datetime import datetime
from typing import List
import itertools
import time
import random
import tempfile

from llama_index.llms.ollama import Ollama
from llama_index.core import PromptTemplate, Document
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

# Embedded hyperparameters
hyperparameters = {
    "document_parsers": {
        "pdf_reader": "PDFReader"
    },
    "node_parser": {
        "type": "SentenceSplitter",
        "chunk_size": 512,
        "chunk_overlap": 0
    },
    "embedding_model": {
        "name": "BAAI/bge-m3",
        "dimension": 1024
    },
    "vector_store": {
        "type": "MilvusVectorStore",
        "uri": "local_milvus_lite.db",
        "overwrite": True
    },
    "llm_models": [
        "hermes3:8b-llama3.1-q8_0",
        "mistral-nemo:12b-instruct-2407-q8_0",
        "swaroopsittula/dragon-llama3.1:latest",
        "nemotron-mini:4b-instruct-q8_0",
        "supernova:latest",
        "phi3:14b-medium-128k-instruct-q5_K_M",
        "qwen2.5:32b",
        "mixtral:8x7b-instruct-v0.1-q3_K_M",
        #"gemma2:27b-instruct-q5_K_M",
        "llama3.2:latest",
        "mistral:7b-instruct-v0.3-q8_0",
        "llama3.1:8b-instruct-q8_0",
        ],
    "prompt_templates": ["You are an artificial intelligence expert in chess. You will answer questions from various players clearly and concisely, using a series of excerpts from the rules of chess as a reference.\n\nSpecifically, you will use the following fragments as context:\n\n{context_str}\n\nto answer the question asked by a player. \n\n{query_str}\n\nRemember to always provide all the information you have in a clear and precise manner. If you don't have information to answer the question, respond \"I don't have information to answer that question\"."
    ],
    "similarity_top_k": [4],
    "evaluation_metrics": [
        "AnswerRelevancyMetric",
        "FaithfulnessMetric"
    ],
    "grader": ["mistral-nemo:12b-instruct-2407-q8_0"],
    "temperature": [0]
}

# Set up logging
logs_dir = "logs"
os.makedirs(logs_dir, exist_ok=True)
current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
log_file_name = f"evaluation_{current_time}.log"
log_file_path = os.path.join(logs_dir, log_file_name)

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_file_path),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)
logger.info(f"Logging to file: {log_file_path}")

# Fixed chess knowledge chunks
chess_knowledge = [
    "Chess is a two-player strategy game played on a board with 64 squares arranged in an 8x8 grid. Each player starts with 16 pieces: one king, one queen, two rooks, two knights, two bishops, and eight pawns. The objective is to checkmate the opponent's king.",
    "In chess, pieces move in specific ways. The king can move one square in any direction. The queen can move any number of squares in any direction. Rooks move horizontally or vertically, bishops move diagonally, and knights move in an L-shape. Pawns typically move forward one square at a time.",
    "Special chess rules include castling, where the king and rook can move simultaneously under certain conditions, and en passant, a special pawn capture. Pawns can also be promoted to any other piece (except a king) upon reaching the opposite end of the board."
]

# Chess questions
chess_questions = [
    "How does the knight move in chess?",
    "What is the objective of a chess game?",
    "How many squares are on a standard chess board?",
    "What is castling in chess?",
    "How does the queen move in chess?",
    "What is en passant in chess?",
    "How many pieces does each player start with in chess?",
    "How do pawns capture other pieces in chess?",
    "What happens when a pawn reaches the opposite end of the board?",
    "How does the king move in chess?"
]

def setup_milvus_lite():
    # Create a temporary directory for Milvus Lite
    temp_dir = tempfile.mkdtemp(prefix="milvus_lite_")
    
    # Set up Milvus Lite vector store
    vector_store = MilvusVectorStore(
        dim=hyperparameters['embedding_model']['dimension'],
        db_name="chess_knowledge",
        collection_name="chess_rules",
        uri="temp_milvus.db",
        overwrite=True
    )
    
    # Set up embedding model
    embed_model = HuggingFaceEmbedding(model_name=hyperparameters['embedding_model']['name'])
    
    # Create storage context and index
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    nodes = [Document(text=chunk) for chunk in chess_knowledge]
    index = VectorStoreIndex(nodes, storage_context=storage_context, embed_model=embed_model)
    
    return index, temp_dir

def get_ollama_response(model: str, prompt: str, question: str, index: VectorStoreIndex):
    llm = Ollama(model=model, request_timeout=60, temperature=hyperparameters['temperature'][0])
    qa_prompt = PromptTemplate(prompt)
    
    retriever = VectorIndexRetriever(index=index, similarity_top_k=hyperparameters['similarity_top_k'][0])
    response_synthesizer = get_response_synthesizer(llm=llm, text_qa_template=qa_prompt, response_mode="compact")
    query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=response_synthesizer)
    
    response = query_engine.query(question)
    return response.response

# Question answering with Ollama loop

llm_models = hyperparameters['llm_models']
prompt_templates = hyperparameters['prompt_templates']

index, temp_dir = setup_milvus_lite()

for model in llm_models:
        for prompt in prompt_templates:
            for question in chess_questions:
                logger.info(f"Generating answers for model: {model}")
                response = get_ollama_response(model, prompt, question, index)
                logger.info(f"Answered question: {question}")
                logger.info(f"Response: {response}")

ollama_error_log4.log
ollama_error_log3.log
ollama_error_log2.log
ollama_error_log1.log

evaluation_20241006_105957.log
evaluation_20241006_092659.log
evaluation_20241005_221027.log
evaluation_20241005_133747.log

<!-- gh-comment-id:2395696989 --> @elsatch commented on GitHub (Oct 7, 2024): Hi @dhiltgen, It has taken a while but... now I am able to reproduce it! I have produced a script that simulates my RAG evaluations. I am generating the responses for ten chess related questions, passing three chunks as context (but they are not retrieved in relation to the output, but mostly to add around 300 tokens as context for the answer generation). I generates the ten answers using a model, then loads the next. Answers are generated sequentally, one after the other. Whenever I run this doom_ollama.py script, my system comes to an absolute halt. I have made the system crash loading different combinations of models, so it's not model especific. This is my list of models: llm_models": [ "hermes3:8b-llama3.1-q8_0", "mistral-nemo:12b-instruct-2407-q8_0", "swaroopsittula/dragon-llama3.1:latest", "nemotron-mini:4b-instruct-q8_0", "supernova:latest", "phi3:14b-medium-128k-instruct-q5_K_M", "qwen2.5:32b", "mixtral:8x7b-instruct-v0.1-q3_K_M", "gemma2:27b-instruct-q5_K_M", "llama3.2:latest", "mistral:7b-instruct-v0.3-q8_0", "llama3.1:8b-instruct-q8_0", ], I have used to `ollama ps` command and apparently, there is only one model at the time loaded at my GPU. Some models like Gemma consume part GPU VRAM part CPU RAM. But as one model gets loaded, the previous model does not appear anymore at the `ollama ps` output. In the evaluation_datetime_logs created from the logging package, there are several questions and answers registered but it finishes up like this: ```console 024-10-05 13:39:17,404 - __main__ - INFO - Generating answers for model: phi3:14b-medium-128k-instruct-q5_K_M 2024-10-05 13:39:19,234 - httpx - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK" 2024-10-05 13:39:19,235 - __main__ - INFO - Answered question: How does the king move in chess? 2024-10-05 13:39:19,235 - __main__ - INFO - Response: En el ajedrez, el rey puede moverse una casilla en cualquier dirección: horizontalmente, verticalmente o diagonalmente. Esto significa que tiene un total de 8 movimientos posibles desde su posición inicial (4 hacia adelante y 4 hacia los lados). Sin embargo, es importante recordar que la seguridad del rey es primordial en el juego, por lo que sus movimientos a menudo están limitados por las piezas enemigas. 2024-10-05 13:39:19,235 - __main__ - INFO - Generating answers for model: gemma2:27b-instruct-q5_K_M ``` One model is creating the answers, then switches to another one and the httpx request is never sent (or registered in the log). I have checked the OS logs using journalctl, but the last line of the files is Ollama loading a model and then the crash. Is there any other additional information or test that might be useful to diagnose the root cause? This is the doom_ollama.py script: ```python import os import logging from datetime import datetime from typing import List import itertools import time import random import tempfile from llama_index.llms.ollama import Ollama from llama_index.core import PromptTemplate, Document from llama_index.core import StorageContext, VectorStoreIndex from llama_index.vector_stores.milvus import MilvusVectorStore from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.core.retrievers import VectorIndexRetriever from llama_index.core.query_engine import RetrieverQueryEngine from llama_index.core import get_response_synthesizer from deepeval import evaluate from deepeval.test_case import LLMTestCase from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric # Embedded hyperparameters hyperparameters = { "document_parsers": { "pdf_reader": "PDFReader" }, "node_parser": { "type": "SentenceSplitter", "chunk_size": 512, "chunk_overlap": 0 }, "embedding_model": { "name": "BAAI/bge-m3", "dimension": 1024 }, "vector_store": { "type": "MilvusVectorStore", "uri": "local_milvus_lite.db", "overwrite": True }, "llm_models": [ "hermes3:8b-llama3.1-q8_0", "mistral-nemo:12b-instruct-2407-q8_0", "swaroopsittula/dragon-llama3.1:latest", "nemotron-mini:4b-instruct-q8_0", "supernova:latest", "phi3:14b-medium-128k-instruct-q5_K_M", "qwen2.5:32b", "mixtral:8x7b-instruct-v0.1-q3_K_M", #"gemma2:27b-instruct-q5_K_M", "llama3.2:latest", "mistral:7b-instruct-v0.3-q8_0", "llama3.1:8b-instruct-q8_0", ], "prompt_templates": ["You are an artificial intelligence expert in chess. You will answer questions from various players clearly and concisely, using a series of excerpts from the rules of chess as a reference.\n\nSpecifically, you will use the following fragments as context:\n\n{context_str}\n\nto answer the question asked by a player. \n\n{query_str}\n\nRemember to always provide all the information you have in a clear and precise manner. If you don't have information to answer the question, respond \"I don't have information to answer that question\"." ], "similarity_top_k": [4], "evaluation_metrics": [ "AnswerRelevancyMetric", "FaithfulnessMetric" ], "grader": ["mistral-nemo:12b-instruct-2407-q8_0"], "temperature": [0] } # Set up logging logs_dir = "logs" os.makedirs(logs_dir, exist_ok=True) current_time = datetime.now().strftime("%Y%m%d_%H%M%S") log_file_name = f"evaluation_{current_time}.log" log_file_path = os.path.join(logs_dir, log_file_name) logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler(log_file_path), logging.StreamHandler() ] ) logger = logging.getLogger(__name__) logger.info(f"Logging to file: {log_file_path}") # Fixed chess knowledge chunks chess_knowledge = [ "Chess is a two-player strategy game played on a board with 64 squares arranged in an 8x8 grid. Each player starts with 16 pieces: one king, one queen, two rooks, two knights, two bishops, and eight pawns. The objective is to checkmate the opponent's king.", "In chess, pieces move in specific ways. The king can move one square in any direction. The queen can move any number of squares in any direction. Rooks move horizontally or vertically, bishops move diagonally, and knights move in an L-shape. Pawns typically move forward one square at a time.", "Special chess rules include castling, where the king and rook can move simultaneously under certain conditions, and en passant, a special pawn capture. Pawns can also be promoted to any other piece (except a king) upon reaching the opposite end of the board." ] # Chess questions chess_questions = [ "How does the knight move in chess?", "What is the objective of a chess game?", "How many squares are on a standard chess board?", "What is castling in chess?", "How does the queen move in chess?", "What is en passant in chess?", "How many pieces does each player start with in chess?", "How do pawns capture other pieces in chess?", "What happens when a pawn reaches the opposite end of the board?", "How does the king move in chess?" ] def setup_milvus_lite(): # Create a temporary directory for Milvus Lite temp_dir = tempfile.mkdtemp(prefix="milvus_lite_") # Set up Milvus Lite vector store vector_store = MilvusVectorStore( dim=hyperparameters['embedding_model']['dimension'], db_name="chess_knowledge", collection_name="chess_rules", uri="temp_milvus.db", overwrite=True ) # Set up embedding model embed_model = HuggingFaceEmbedding(model_name=hyperparameters['embedding_model']['name']) # Create storage context and index storage_context = StorageContext.from_defaults(vector_store=vector_store) nodes = [Document(text=chunk) for chunk in chess_knowledge] index = VectorStoreIndex(nodes, storage_context=storage_context, embed_model=embed_model) return index, temp_dir def get_ollama_response(model: str, prompt: str, question: str, index: VectorStoreIndex): llm = Ollama(model=model, request_timeout=60, temperature=hyperparameters['temperature'][0]) qa_prompt = PromptTemplate(prompt) retriever = VectorIndexRetriever(index=index, similarity_top_k=hyperparameters['similarity_top_k'][0]) response_synthesizer = get_response_synthesizer(llm=llm, text_qa_template=qa_prompt, response_mode="compact") query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=response_synthesizer) response = query_engine.query(question) return response.response # Question answering with Ollama loop llm_models = hyperparameters['llm_models'] prompt_templates = hyperparameters['prompt_templates'] index, temp_dir = setup_milvus_lite() for model in llm_models: for prompt in prompt_templates: for question in chess_questions: logger.info(f"Generating answers for model: {model}") response = get_ollama_response(model, prompt, question, index) logger.info(f"Answered question: {question}") logger.info(f"Response: {response}") ``` [ollama_error_log4.log](https://github.com/user-attachments/files/17271462/ollama_error_log4.log) [ollama_error_log3.log](https://github.com/user-attachments/files/17271463/ollama_error_log3.log) [ollama_error_log2.log](https://github.com/user-attachments/files/17271464/ollama_error_log2.log) [ollama_error_log1.log](https://github.com/user-attachments/files/17271465/ollama_error_log1.log) [evaluation_20241006_105957.log](https://github.com/user-attachments/files/17271482/evaluation_20241006_105957.log) [evaluation_20241006_092659.log](https://github.com/user-attachments/files/17271483/evaluation_20241006_092659.log) [evaluation_20241005_221027.log](https://github.com/user-attachments/files/17271484/evaluation_20241005_221027.log) [evaluation_20241005_133747.log](https://github.com/user-attachments/files/17271485/evaluation_20241005_133747.log)
Author
Owner

@dhiltgen commented on GitHub (Oct 17, 2024):

A crash of the computer may be hardware problems or driver bugs. Now that you've got a repro scenario, lets see if we can get some details on that side.

The following may yield some details from the nvidia driver or linux kernel about hangs/panics/etc. (start this before you begin your repro scenario and hopefully just as the system hangs you'll see something interesting)

sudo dmesg -w

You can also set CUDA_ERROR_LEVEL=50 for the ollama server, and that may generate some additional log messages in the ollama logs from cuda library failures leading up to the hang/crash.

<!-- gh-comment-id:2420034840 --> @dhiltgen commented on GitHub (Oct 17, 2024): A crash of the computer may be hardware problems or driver bugs. Now that you've got a repro scenario, lets see if we can get some details on that side. The following may yield some details from the nvidia driver or linux kernel about hangs/panics/etc. (start this before you begin your repro scenario and hopefully just as the system hangs you'll see something interesting) ``` sudo dmesg -w ``` You can also set `CUDA_ERROR_LEVEL=50` for the ollama server, and that may generate some additional log messages in the ollama logs from cuda library failures leading up to the hang/crash.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50544