[GH-ISSUE #4427] ollama can't run qwen:72b, error msg ""gpu VRAM usage didn't recover within timeout #64802

Closed
opened 2026-05-03 18:48:38 -05:00 by GiteaMirror · 29 comments
Owner

Originally created by @changingshow on GitHub (May 14, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4427

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I have already downloaded qwen:7b, but when i run ollama run qwen:7b,got this error Error: timed out waiting for llama runner to start:, in the server.log have this msg gpu VRAM usage didn't recover within timeout

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.1.37

Originally created by @changingshow on GitHub (May 14, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4427 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I have already downloaded qwen:7b, but when i run `ollama run qwen:7b`,got this error `Error: timed out waiting for llama runner to start:`, in the server.log have this msg `gpu VRAM usage didn't recover within timeout` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version ollama version is 0.1.37
GiteaMirror added the bug label 2026-05-03 18:48:38 -05:00
Author
Owner

@chrisoutwright commented on GitHub (May 19, 2024):

I also get it (with smaller Params) when running RTX 2080 TI OR GTX 1060 with codeqwen:chat and codegamma:instruct for Win10.
It stays in RAM but will have to copy to GPU RAM everytime after one chat POST.

The GPU RAM is not exceeding on them, so not sure why it timesout every time.

<!-- gh-comment-id:2119212965 --> @chrisoutwright commented on GitHub (May 19, 2024): I also get it (with smaller Params) when running RTX 2080 TI OR GTX 1060 with codeqwen:chat and codegamma:instruct for Win10. It stays in RAM but will have to copy to GPU RAM everytime after one chat POST. The GPU RAM is not exceeding on them, so not sure why it timesout every time.
Author
Owner

@dhiltgen commented on GitHub (May 21, 2024):

This was fixed in 0.1.38 via pr #4430. If you're still seeing problems after upgrading, please share your server log and I'll reopen.

<!-- gh-comment-id:2123539820 --> @dhiltgen commented on GitHub (May 21, 2024): This was fixed in 0.1.38 via pr #4430. If you're still seeing problems after upgrading, please share your server log and I'll reopen.
Author
Owner

@pamanseau commented on GitHub (May 24, 2024):

@dhiltgen I have the same issue with 0.1.38 with Linux
ollama.log

<!-- gh-comment-id:2129194180 --> @pamanseau commented on GitHub (May 24, 2024): @dhiltgen I have the same issue with 0.1.38 with Linux [ollama.log](https://github.com/ollama/ollama/files/15432100/ollama.log)
Author
Owner

@dhiltgen commented on GitHub (May 25, 2024):

@pamanseau from the logs you shared, it looks like the client gave up before the model finished loading, and since the client request was canceled, we canceled the loading of the model. Are you using our CLI, or are you calling the API? If you're calling the API, what timeout are you setting in your client?

<!-- gh-comment-id:2131320189 --> @dhiltgen commented on GitHub (May 25, 2024): @pamanseau from the logs you shared, it looks like the client gave up before the model finished loading, and since the client request was canceled, we canceled the loading of the model. Are you using our CLI, or are you calling the API? If you're calling the API, what timeout are you setting in your client?
Author
Owner

@kirkster96 commented on GitHub (Jun 1, 2024):

@dhiltgen

I also have the same issue using:
ollama/ollama:latest on Docker
I notice that this occurs after I successfully run a prompt and then let it idle.
When I come back, I have to wait for the llama runner started in... message.
ollama.log

Edit:

Turns out that I did not specify a keep-alive and the default is 5 minutes!
Never mind! 😁
@pamanseau @dhiltgen
How do I keep a model loaded in memory or make it unload immediately?

<!-- gh-comment-id:2143525092 --> @kirkster96 commented on GitHub (Jun 1, 2024): @dhiltgen I also have the same issue using: ollama/ollama:latest on Docker I notice that this occurs after I successfully run a prompt and then let it idle. When I come back, I have to wait for the `llama runner started in...` message. [ollama.log](https://github.com/user-attachments/files/15522748/ollama.log) **Edit:** Turns out that I did not specify a keep-alive and the default is 5 minutes! Never mind! 😁 @pamanseau @dhiltgen [How do I keep a model loaded in memory or make it unload immediately?](https://github.com/ollama/ollama/blob/c02db93243353855b983db2a1562a02b57e66db1/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately)
Author
Owner

@pamanseau commented on GitHub (Jun 3, 2024):

@pamanseau from the logs you shared, it looks like the client gave up before the model finished loading, and since the client request was canceled, we canceled the loading of the model. Are you using our CLI, or are you calling the API? If you're calling the API, what timeout are you setting in your client?

We don't set a specific timeout. using the API or the WebUI.
Perhaps the default timeout is too small.

<!-- gh-comment-id:2145115665 --> @pamanseau commented on GitHub (Jun 3, 2024): > @pamanseau from the logs you shared, it looks like the client gave up before the model finished loading, and since the client request was canceled, we canceled the loading of the model. Are you using our CLI, or are you calling the API? If you're calling the API, what timeout are you setting in your client? We don't set a specific timeout. using the API or the WebUI. Perhaps the default timeout is too small.
Author
Owner

@axel7083 commented on GitHub (Jun 4, 2024):

I am facing the same issue in version v0.1.41 when trying to set PARAMETER num_gpu 33. Got the following

[GIN] 2024/06/04 - 11:07:42 | 200 |         3m23s |       10.88.0.1 | POST     "/v1/chat/completions"
time=2024-06-04T11:12:49.354Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=6.957608932
time=2024-06-04T11:12:51.595Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=9.198405481
time=2024-06-04T11:12:53.757Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=11.359973211
<!-- gh-comment-id:2147307260 --> @axel7083 commented on GitHub (Jun 4, 2024): I am facing the same issue in version `v0.1.41` when trying to set `PARAMETER num_gpu 33`. Got the following ``` [GIN] 2024/06/04 - 11:07:42 | 200 | 3m23s | 10.88.0.1 | POST "/v1/chat/completions" time=2024-06-04T11:12:49.354Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=6.957608932 time=2024-06-04T11:12:51.595Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=9.198405481 time=2024-06-04T11:12:53.757Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=11.359973211 ```
Author
Owner

@dhiltgen commented on GitHub (Jun 4, 2024):

@axel7083 look earlier in your logs. My suspicion is the GPU runner crashed (perhaps 33 layers was too much for your GPU), we fell back to CPU, and there's a minor bug where we mistakenly try to verify VRAM recovery on unload even though the runner was CPU based.

<!-- gh-comment-id:2148073352 --> @dhiltgen commented on GitHub (Jun 4, 2024): @axel7083 look earlier in your logs. My suspicion is the GPU runner crashed (perhaps 33 layers was too much for your GPU), we fell back to CPU, and there's a minor bug where we mistakenly try to verify VRAM recovery on unload even though the runner was CPU based.
Author
Owner

@dhiltgen commented on GitHub (Jun 4, 2024):

@pamanseau can you try to repro with the CLI to rule out client timeouts?

<!-- gh-comment-id:2148074157 --> @dhiltgen commented on GitHub (Jun 4, 2024): @pamanseau can you try to repro with the CLI to rule out client timeouts?
Author
Owner

@abstract-entity commented on GitHub (Jun 5, 2024):

Hello got the same here with a RTX 4090 GPU

2024-06-05 09:56:37 time=2024-06-05T07:56:37.849Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.10060701
2024-06-05 09:56:38 time=2024-06-05T07:56:38.099Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.350123004
2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama2435727696/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 36319"
2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=sched.go:338 msg="loaded runners" count=1
2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
2024-06-05 09:56:38 time=2024-06-05T07:56:38.349Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.600819928
2024-06-05 09:56:38 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
2024-06-05 09:56:38 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-06-05 09:56:38 llama_model_loader: - kv 0: general.architecture str = llama
2024-06-05 09:56:32 [GIN] 2024/06/05 - 07:56:32 | 200 | 11.239µs | 172.18.0.3 | HEAD "/"
2024-06-05 09:56:38 INFO [main] build info | build=1 commit="5921b8f" tid="139796719271936" timestamp=1717574198
2024-06-05 09:56:38 INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139796719271936" timestamp=1717574198 total_threads=32
2024-06-05 09:56:38 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="36319" tid="139796719271936" timestamp=1717574198
2024-06-05 09:56:38 llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
2024-06-05 09:56:38 llama_model_loader: - kv 2: llama.block_count u32 = 32
2024-06-05 09:56:38 llama_model_loader: - kv 3: llama.context_length u32 = 8192
2024-06-05 09:56:38 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
2024-06-05 09:56:38 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
2024-06-05 09:56:38 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
2024-06-05 09:56:38 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
2024-06-05 09:56:38 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
2024-06-05 09:56:38 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
2024-06-05 09:56:38 llama_model_loader: - kv 10: general.file_type u32 = 2
2024-06-05 09:56:38 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
2024-06-05 09:56:38 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
2024-06-05 09:56:38 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
2024-06-05 09:56:38 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
2024-06-05 09:56:38 llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
2024-06-05 09:56:38 llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-06-05 09:56:38 llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
2024-06-05 09:56:38 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
2024-06-05 09:56:38 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
2024-06-05 09:56:38 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
2024-06-05 09:56:38 llama_model_loader: - kv 21: general.quantization_version u32 = 2
2024-06-05 09:56:38 llama_model_loader: - type f32: 65 tensors
2024-06-05 09:56:38 llama_model_loader: - type q4_0: 225 tensors
2024-06-05 09:56:38 llama_model_loader: - type q6_K: 1 tensors
2024-06-05 09:56:38 time=2024-06-05T07:56:38.590Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"
2024-06-05 09:56:38 llm_load_vocab: special tokens cache size = 256
2024-06-05 09:56:38 llm_load_vocab: token to piece cache size = 1.5928 MB
2024-06-05 09:56:38 llm_load_print_meta: format = GGUF V3 (latest)
2024-06-05 09:56:38 llm_load_print_meta: arch = llama
2024-06-05 09:56:38 llm_load_print_meta: vocab type = BPE
2024-06-05 09:56:38 llm_load_print_meta: n_vocab = 128256
2024-06-05 09:56:38 llm_load_print_meta: n_merges = 280147
2024-06-05 09:56:38 llm_load_print_meta: n_ctx_train = 8192
2024-06-05 09:56:38 llm_load_print_meta: n_embd = 4096
2024-06-05 09:56:38 llm_load_print_meta: n_head = 32
2024-06-05 09:56:38 llm_load_print_meta: n_head_kv = 8
2024-06-05 09:56:38 llm_load_print_meta: n_layer = 32
2024-06-05 09:56:38 llm_load_print_meta: n_rot = 128
2024-06-05 09:56:38 llm_load_print_meta: n_embd_head_k = 128
2024-06-05 09:56:38 llm_load_print_meta: n_embd_head_v = 128
2024-06-05 09:56:38 llm_load_print_meta: n_gqa = 4
2024-06-05 09:56:38 llm_load_print_meta: n_embd_k_gqa = 1024
2024-06-05 09:56:38 llm_load_print_meta: n_embd_v_gqa = 1024
2024-06-05 09:56:38 llm_load_print_meta: f_norm_eps = 0.0e+00
2024-06-05 09:56:38 llm_load_print_meta: f_norm_rms_eps = 1.0e-05
2024-06-05 09:56:38 llm_load_print_meta: f_clamp_kqv = 0.0e+00
2024-06-05 09:56:38 llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-06-05 09:56:38 llm_load_print_meta: f_logit_scale = 0.0e+00
2024-06-05 09:56:38 llm_load_print_meta: n_ff = 14336
2024-06-05 09:56:38 llm_load_print_meta: n_expert = 0
2024-06-05 09:56:38 llm_load_print_meta: n_expert_used = 0
2024-06-05 09:56:38 llm_load_print_meta: causal attn = 1
2024-06-05 09:56:38 llm_load_print_meta: pooling type = 0
2024-06-05 09:56:38 llm_load_print_meta: rope type = 0
2024-06-05 09:56:38 llm_load_print_meta: rope scaling = linear
2024-06-05 09:56:38 llm_load_print_meta: freq_base_train = 500000.0
2024-06-05 09:56:38 llm_load_print_meta: freq_scale_train = 1
2024-06-05 09:56:38 llm_load_print_meta: n_yarn_orig_ctx = 8192
2024-06-05 09:56:38 llm_load_print_meta: rope_finetuned = unknown
2024-06-05 09:56:38 llm_load_print_meta: ssm_d_conv = 0
2024-06-05 09:56:38 llm_load_print_meta: ssm_d_inner = 0
2024-06-05 09:56:38 llm_load_print_meta: ssm_d_state = 0
2024-06-05 09:56:38 llm_load_print_meta: ssm_dt_rank = 0
2024-06-05 09:56:38 llm_load_print_meta: model type = 8B
2024-06-05 09:56:38 llm_load_print_meta: model ftype = Q4_0
2024-06-05 09:56:38 llm_load_print_meta: model params = 8.03 B
2024-06-05 09:56:38 llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
2024-06-05 09:56:38 llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
2024-06-05 09:56:38 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
2024-06-05 09:56:38 llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
2024-06-05 09:56:38 llm_load_print_meta: LF token = 128 'Ä'
2024-06-05 09:56:38 llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
2024-06-05 09:56:38 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
2024-06-05 09:56:38 ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
2024-06-05 09:56:38 ggml_cuda_init: found 1 CUDA devices:
2024-06-05 09:56:38 Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
2024-06-05 09:56:38 llm_load_tensors: ggml ctx size = 0.30 MiB
2024-06-05 09:56:39 llm_load_tensors: offloading 32 repeating layers to GPU
2024-06-05 09:56:39 llm_load_tensors: offloading non-repeating layers to GPU
2024-06-05 09:56:39 llm_load_tensors: offloaded 33/33 layers to GPU
2024-06-05 09:56:39 llm_load_tensors: CPU buffer size = 281.81 MiB
2024-06-05 09:56:39 llm_load_tensors: CUDA0 buffer size = 4155.99 MiB
2024-06-05 09:56:39 llama_new_context_with_model: n_ctx = 2048
2024-06-05 09:56:39 llama_new_context_with_model: n_batch = 512
2024-06-05 09:56:39 llama_new_context_with_model: n_ubatch = 512
2024-06-05 09:56:39 llama_new_context_with_model: flash_attn = 0
2024-06-05 09:56:39 llama_new_context_with_model: freq_base = 500000.0
2024-06-05 09:56:39 llama_new_context_with_model: freq_scale = 1
2024-06-05 09:56:39 llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB
2024-06-05 09:56:39 llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
2024-06-05 09:56:39 llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB
2024-06-05 09:56:39 llama_new_context_with_model: CUDA0 compute buffer size = 258.50 MiB
2024-06-05 09:56:39 llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
2024-06-05 09:56:39 llama_new_context_with_model: graph nodes = 1030
2024-06-05 09:56:39 llama_new_context_with_model: graph splits = 2
2024-06-05 09:56:40 time=2024-06-05T07:56:40.095Z level=INFO source=server.go:572 msg="llama runner started in 1.76 seconds"
2024-06-05 09:56:45 time=2024-06-05T07:56:45.546Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.076150155
2024-06-05 09:56:45 time=2024-06-05T07:56:45.796Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3259817720000004
2024-06-05 09:56:46 time=2024-06-05T07:56:46.042Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama2435727696/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --mlock --parallel 1 --port 41889"
2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=sched.go:338 msg="loaded runners" count=1
2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
2024-06-05 09:56:46 time=2024-06-05T07:56:46.045Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.575500551
2024-06-05 09:56:46 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
2024-06-05 09:56:46 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-06-05 09:56:46 llama_model_loader: - kv 0: general.architecture str = llama
2024-06-05 09:56:46 llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
2024-06-05 09:56:46 llama_model_loader: - kv 2: llama.block_count u32 = 32
2024-06-05 09:56:46 llama_model_loader: - kv 3: llama.context_length u32 = 8192
2024-06-05 09:56:46 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
2024-06-05 09:56:46 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
2024-06-05 09:56:46 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
2024-06-05 09:56:46 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
2024-06-05 09:56:46 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
2024-06-05 09:56:46 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
2024-06-05 09:56:46 llama_model_loader: - kv 10: general.file_type u32 = 2
2024-06-05 09:56:46 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
2024-06-05 09:56:46 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
2024-06-05 09:56:46 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
2024-06-05 09:56:46 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
2024-06-05 09:56:46 llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
2024-06-05 09:56:46 llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-06-05 09:56:46 llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
2024-06-05 09:56:46 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
2024-06-05 09:56:46 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
2024-06-05 09:56:46 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
2024-06-05 09:56:46 llama_model_loader: - kv 21: general.quantization_version u32 = 2
2024-06-05 09:56:46 llama_model_loader: - type f32: 65 tensors
2024-06-05 09:56:39 INFO [main] model loaded | tid="139796719271936" timestamp=1717574199
2024-06-05 09:56:40 [GIN] 2024/06/05 - 07:56:40 | 200 | 7.453372923s | 172.18.0.3 | POST "/api/embeddings"
2024-06-05 09:56:46 INFO [main] build info | build=1 commit="5921b8f" tid="140222294765568" timestamp=1717574206
2024-06-05 09:56:46 INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140222294765568" timestamp=1717574206 total_threads=32
2024-06-05 09:56:46 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="41889" tid="140222294765568" timestamp=1717574206
2024-06-05 09:56:47 INFO [main] model loaded | tid="140222294765568" timestamp=1717574207
2024-06-05 09:56:50 [GIN] 2024/06/05 - 07:56:50 | 200 | 10.221225661s | 172.18.0.3 | POST "/api/chat"
2024-06-05 09:56:46 llama_model_loader: - type q4_0: 225 tensors
2024-06-05 09:56:46 llama_model_loader: - type q6_K: 1 tensors
2024-06-05 09:56:46 time=2024-06-05T07:56:46.294Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"
2024-06-05 09:56:46 llm_load_vocab: special tokens cache size = 256
2024-06-05 09:56:46 llm_load_vocab: token to piece cache size = 1.5928 MB
2024-06-05 09:56:46 llm_load_print_meta: format = GGUF V3 (latest)
2024-06-05 09:56:46 llm_load_print_meta: arch = llama
2024-06-05 09:56:46 llm_load_print_meta: vocab type = BPE
2024-06-05 09:56:46 llm_load_print_meta: n_vocab = 128256
2024-06-05 09:56:46 llm_load_print_meta: n_merges = 280147
2024-06-05 09:56:46 llm_load_print_meta: n_ctx_train = 8192
2024-06-05 09:56:46 llm_load_print_meta: n_embd = 4096
2024-06-05 09:56:46 llm_load_print_meta: n_head = 32
2024-06-05 09:56:46 llm_load_print_meta: n_head_kv = 8
2024-06-05 09:56:46 llm_load_print_meta: n_layer = 32
2024-06-05 09:56:46 llm_load_print_meta: n_rot = 128
2024-06-05 09:56:46 llm_load_print_meta: n_embd_head_k = 128
2024-06-05 09:56:46 llm_load_print_meta: n_embd_head_v = 128
2024-06-05 09:56:46 llm_load_print_meta: n_gqa = 4
2024-06-05 09:56:46 llm_load_print_meta: n_embd_k_gqa = 1024
2024-06-05 09:56:46 llm_load_print_meta: n_embd_v_gqa = 1024
2024-06-05 09:56:46 llm_load_print_meta: f_norm_eps = 0.0e+00
2024-06-05 09:56:46 llm_load_print_meta: f_norm_rms_eps = 1.0e-05
2024-06-05 09:56:46 llm_load_print_meta: f_clamp_kqv = 0.0e+00
2024-06-05 09:56:46 llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-06-05 09:56:46 llm_load_print_meta: f_logit_scale = 0.0e+00
2024-06-05 09:56:46 llm_load_print_meta: n_ff = 14336
2024-06-05 09:56:46 llm_load_print_meta: n_expert = 0
2024-06-05 09:56:46 llm_load_print_meta: n_expert_used = 0
2024-06-05 09:56:46 llm_load_print_meta: causal attn = 1
2024-06-05 09:56:46 llm_load_print_meta: pooling type = 0
2024-06-05 09:56:46 llm_load_print_meta: rope type = 0
2024-06-05 09:56:46 llm_load_print_meta: rope scaling = linear
2024-06-05 09:56:46 llm_load_print_meta: freq_base_train = 500000.0
2024-06-05 09:56:46 llm_load_print_meta: freq_scale_train = 1
2024-06-05 09:56:46 llm_load_print_meta: n_yarn_orig_ctx = 8192
2024-06-05 09:56:46 llm_load_print_meta: rope_finetuned = unknown
2024-06-05 09:56:46 llm_load_print_meta: ssm_d_conv = 0
2024-06-05 09:56:46 llm_load_print_meta: ssm_d_inner = 0
2024-06-05 09:56:46 llm_load_print_meta: ssm_d_state = 0
2024-06-05 09:56:46 llm_load_print_meta: ssm_dt_rank = 0
2024-06-05 09:56:46 llm_load_print_meta: model type = 8B
2024-06-05 09:56:46 llm_load_print_meta: model ftype = Q4_0
2024-06-05 09:56:46 llm_load_print_meta: model params = 8.03 B
2024-06-05 09:56:46 llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
2024-06-05 09:56:46 llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
2024-06-05 09:56:46 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
2024-06-05 09:56:46 llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
2024-06-05 09:56:46 llm_load_print_meta: LF token = 128 'Ä'
2024-06-05 09:56:46 llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
2024-06-05 09:56:46 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
2024-06-05 09:56:46 ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
2024-06-05 09:56:46 ggml_cuda_init: found 1 CUDA devices:
2024-06-05 09:56:46 Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
2024-06-05 09:56:46 llm_load_tensors: ggml ctx size = 0.30 MiB
2024-06-05 09:56:46 llm_load_tensors: offloading 32 repeating layers to GPU
2024-06-05 09:56:46 llm_load_tensors: offloading non-repeating layers to GPU
2024-06-05 09:56:46 llm_load_tensors: offloaded 33/33 layers to GPU
2024-06-05 09:56:46 llm_load_tensors: CPU buffer size = 281.81 MiB
2024-06-05 09:56:46 llm_load_tensors: CUDA0 buffer size = 4155.99 MiB
2024-06-05 09:56:47 llama_new_context_with_model: n_ctx = 2048
2024-06-05 09:56:47 llama_new_context_with_model: n_batch = 512
2024-06-05 09:56:47 llama_new_context_with_model: n_ubatch = 512
2024-06-05 09:56:47 llama_new_context_with_model: flash_attn = 0
2024-06-05 09:56:47 llama_new_context_with_model: freq_base = 500000.0
2024-06-05 09:56:47 llama_new_context_with_model: freq_scale = 1
2024-06-05 09:56:47 llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB
2024-06-05 09:56:47 llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
2024-06-05 09:56:47 llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB
2024-06-05 09:56:47 llama_new_context_with_model: CUDA0 compute buffer size = 258.50 MiB
2024-06-05 09:56:47 llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
2024-06-05 09:56:47 llama_new_context_with_model: graph nodes = 1030
2024-06-05 09:56:47 llama_new_context_with_model: graph splits = 2
2024-06-05 09:56:47 time=2024-06-05T07:56:47.798Z level=INFO source=server.go:572 msg="llama runner started in 1.76 seconds"

<!-- gh-comment-id:2149139604 --> @abstract-entity commented on GitHub (Jun 5, 2024): Hello got the same here with a RTX 4090 GPU > 2024-06-05 09:56:37 time=2024-06-05T07:56:37.849Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.10060701 2024-06-05 09:56:38 time=2024-06-05T07:56:38.099Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.350123004 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama2435727696/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 36319" 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=sched.go:338 msg="loaded runners" count=1 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding" 2024-06-05 09:56:38 time=2024-06-05T07:56:38.339Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" 2024-06-05 09:56:38 time=2024-06-05T07:56:38.349Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.600819928 2024-06-05 09:56:38 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) 2024-06-05 09:56:38 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-06-05 09:56:38 llama_model_loader: - kv 0: general.architecture str = llama 2024-06-05 09:56:32 [GIN] 2024/06/05 - 07:56:32 | 200 | 11.239µs | 172.18.0.3 | HEAD "/" 2024-06-05 09:56:38 INFO [main] build info | build=1 commit="5921b8f" tid="139796719271936" timestamp=1717574198 2024-06-05 09:56:38 INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139796719271936" timestamp=1717574198 total_threads=32 2024-06-05 09:56:38 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="36319" tid="139796719271936" timestamp=1717574198 2024-06-05 09:56:38 llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct 2024-06-05 09:56:38 llama_model_loader: - kv 2: llama.block_count u32 = 32 2024-06-05 09:56:38 llama_model_loader: - kv 3: llama.context_length u32 = 8192 2024-06-05 09:56:38 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 2024-06-05 09:56:38 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 2024-06-05 09:56:38 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 2024-06-05 09:56:38 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 2024-06-05 09:56:38 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 2024-06-05 09:56:38 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 2024-06-05 09:56:38 llama_model_loader: - kv 10: general.file_type u32 = 2 2024-06-05 09:56:38 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 2024-06-05 09:56:38 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 2024-06-05 09:56:38 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 2024-06-05 09:56:38 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe 2024-06-05 09:56:38 llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-06-05 09:56:38 llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-06-05 09:56:38 llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 2024-06-05 09:56:38 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 2024-06-05 09:56:38 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 2024-06-05 09:56:38 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... 2024-06-05 09:56:38 llama_model_loader: - kv 21: general.quantization_version u32 = 2 2024-06-05 09:56:38 llama_model_loader: - type f32: 65 tensors 2024-06-05 09:56:38 llama_model_loader: - type q4_0: 225 tensors 2024-06-05 09:56:38 llama_model_loader: - type q6_K: 1 tensors 2024-06-05 09:56:38 time=2024-06-05T07:56:38.590Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model" 2024-06-05 09:56:38 llm_load_vocab: special tokens cache size = 256 2024-06-05 09:56:38 llm_load_vocab: token to piece cache size = 1.5928 MB 2024-06-05 09:56:38 llm_load_print_meta: format = GGUF V3 (latest) 2024-06-05 09:56:38 llm_load_print_meta: arch = llama 2024-06-05 09:56:38 llm_load_print_meta: vocab type = BPE 2024-06-05 09:56:38 llm_load_print_meta: n_vocab = 128256 2024-06-05 09:56:38 llm_load_print_meta: n_merges = 280147 2024-06-05 09:56:38 llm_load_print_meta: n_ctx_train = 8192 2024-06-05 09:56:38 llm_load_print_meta: n_embd = 4096 2024-06-05 09:56:38 llm_load_print_meta: n_head = 32 2024-06-05 09:56:38 llm_load_print_meta: n_head_kv = 8 2024-06-05 09:56:38 llm_load_print_meta: n_layer = 32 2024-06-05 09:56:38 llm_load_print_meta: n_rot = 128 2024-06-05 09:56:38 llm_load_print_meta: n_embd_head_k = 128 2024-06-05 09:56:38 llm_load_print_meta: n_embd_head_v = 128 2024-06-05 09:56:38 llm_load_print_meta: n_gqa = 4 2024-06-05 09:56:38 llm_load_print_meta: n_embd_k_gqa = 1024 2024-06-05 09:56:38 llm_load_print_meta: n_embd_v_gqa = 1024 2024-06-05 09:56:38 llm_load_print_meta: f_norm_eps = 0.0e+00 2024-06-05 09:56:38 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 2024-06-05 09:56:38 llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-06-05 09:56:38 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-06-05 09:56:38 llm_load_print_meta: f_logit_scale = 0.0e+00 2024-06-05 09:56:38 llm_load_print_meta: n_ff = 14336 2024-06-05 09:56:38 llm_load_print_meta: n_expert = 0 2024-06-05 09:56:38 llm_load_print_meta: n_expert_used = 0 2024-06-05 09:56:38 llm_load_print_meta: causal attn = 1 2024-06-05 09:56:38 llm_load_print_meta: pooling type = 0 2024-06-05 09:56:38 llm_load_print_meta: rope type = 0 2024-06-05 09:56:38 llm_load_print_meta: rope scaling = linear 2024-06-05 09:56:38 llm_load_print_meta: freq_base_train = 500000.0 2024-06-05 09:56:38 llm_load_print_meta: freq_scale_train = 1 2024-06-05 09:56:38 llm_load_print_meta: n_yarn_orig_ctx = 8192 2024-06-05 09:56:38 llm_load_print_meta: rope_finetuned = unknown 2024-06-05 09:56:38 llm_load_print_meta: ssm_d_conv = 0 2024-06-05 09:56:38 llm_load_print_meta: ssm_d_inner = 0 2024-06-05 09:56:38 llm_load_print_meta: ssm_d_state = 0 2024-06-05 09:56:38 llm_load_print_meta: ssm_dt_rank = 0 2024-06-05 09:56:38 llm_load_print_meta: model type = 8B 2024-06-05 09:56:38 llm_load_print_meta: model ftype = Q4_0 2024-06-05 09:56:38 llm_load_print_meta: model params = 8.03 B 2024-06-05 09:56:38 llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) 2024-06-05 09:56:38 llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct 2024-06-05 09:56:38 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' 2024-06-05 09:56:38 llm_load_print_meta: EOS token = 128009 '<|eot_id|>' 2024-06-05 09:56:38 llm_load_print_meta: LF token = 128 'Ä' 2024-06-05 09:56:38 llm_load_print_meta: EOT token = 128009 '<|eot_id|>' 2024-06-05 09:56:38 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes 2024-06-05 09:56:38 ggml_cuda_init: CUDA_USE_TENSOR_CORES: no 2024-06-05 09:56:38 ggml_cuda_init: found 1 CUDA devices: 2024-06-05 09:56:38 Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes 2024-06-05 09:56:38 llm_load_tensors: ggml ctx size = 0.30 MiB 2024-06-05 09:56:39 llm_load_tensors: offloading 32 repeating layers to GPU 2024-06-05 09:56:39 llm_load_tensors: offloading non-repeating layers to GPU 2024-06-05 09:56:39 llm_load_tensors: offloaded 33/33 layers to GPU 2024-06-05 09:56:39 llm_load_tensors: CPU buffer size = 281.81 MiB 2024-06-05 09:56:39 llm_load_tensors: CUDA0 buffer size = 4155.99 MiB 2024-06-05 09:56:39 llama_new_context_with_model: n_ctx = 2048 2024-06-05 09:56:39 llama_new_context_with_model: n_batch = 512 2024-06-05 09:56:39 llama_new_context_with_model: n_ubatch = 512 2024-06-05 09:56:39 llama_new_context_with_model: flash_attn = 0 2024-06-05 09:56:39 llama_new_context_with_model: freq_base = 500000.0 2024-06-05 09:56:39 llama_new_context_with_model: freq_scale = 1 2024-06-05 09:56:39 llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB 2024-06-05 09:56:39 llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB 2024-06-05 09:56:39 llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB 2024-06-05 09:56:39 llama_new_context_with_model: CUDA0 compute buffer size = 258.50 MiB 2024-06-05 09:56:39 llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB 2024-06-05 09:56:39 llama_new_context_with_model: graph nodes = 1030 2024-06-05 09:56:39 llama_new_context_with_model: graph splits = 2 2024-06-05 09:56:40 time=2024-06-05T07:56:40.095Z level=INFO source=server.go:572 msg="llama runner started in 1.76 seconds" 2024-06-05 09:56:45 time=2024-06-05T07:56:45.546Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.076150155 2024-06-05 09:56:45 time=2024-06-05T07:56:45.796Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3259817720000004 2024-06-05 09:56:46 time=2024-06-05T07:56:46.042Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" 2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="22.5 GiB" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" 2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama2435727696/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --mlock --parallel 1 --port 41889" 2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=sched.go:338 msg="loaded runners" count=1 2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding" 2024-06-05 09:56:46 time=2024-06-05T07:56:46.043Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" 2024-06-05 09:56:46 time=2024-06-05T07:56:46.045Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.575500551 2024-06-05 09:56:46 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) 2024-06-05 09:56:46 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-06-05 09:56:46 llama_model_loader: - kv 0: general.architecture str = llama 2024-06-05 09:56:46 llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct 2024-06-05 09:56:46 llama_model_loader: - kv 2: llama.block_count u32 = 32 2024-06-05 09:56:46 llama_model_loader: - kv 3: llama.context_length u32 = 8192 2024-06-05 09:56:46 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 2024-06-05 09:56:46 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 2024-06-05 09:56:46 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 2024-06-05 09:56:46 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 2024-06-05 09:56:46 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 2024-06-05 09:56:46 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 2024-06-05 09:56:46 llama_model_loader: - kv 10: general.file_type u32 = 2 2024-06-05 09:56:46 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 2024-06-05 09:56:46 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 2024-06-05 09:56:46 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 2024-06-05 09:56:46 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe 2024-06-05 09:56:46 llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-06-05 09:56:46 llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-06-05 09:56:46 llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 2024-06-05 09:56:46 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 2024-06-05 09:56:46 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 2024-06-05 09:56:46 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... 2024-06-05 09:56:46 llama_model_loader: - kv 21: general.quantization_version u32 = 2 2024-06-05 09:56:46 llama_model_loader: - type f32: 65 tensors 2024-06-05 09:56:39 INFO [main] model loaded | tid="139796719271936" timestamp=1717574199 2024-06-05 09:56:40 [GIN] 2024/06/05 - 07:56:40 | 200 | 7.453372923s | 172.18.0.3 | POST "/api/embeddings" 2024-06-05 09:56:46 INFO [main] build info | build=1 commit="5921b8f" tid="140222294765568" timestamp=1717574206 2024-06-05 09:56:46 INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140222294765568" timestamp=1717574206 total_threads=32 2024-06-05 09:56:46 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="41889" tid="140222294765568" timestamp=1717574206 2024-06-05 09:56:47 INFO [main] model loaded | tid="140222294765568" timestamp=1717574207 2024-06-05 09:56:50 [GIN] 2024/06/05 - 07:56:50 | 200 | 10.221225661s | 172.18.0.3 | POST "/api/chat" 2024-06-05 09:56:46 llama_model_loader: - type q4_0: 225 tensors 2024-06-05 09:56:46 llama_model_loader: - type q6_K: 1 tensors 2024-06-05 09:56:46 time=2024-06-05T07:56:46.294Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model" 2024-06-05 09:56:46 llm_load_vocab: special tokens cache size = 256 2024-06-05 09:56:46 llm_load_vocab: token to piece cache size = 1.5928 MB 2024-06-05 09:56:46 llm_load_print_meta: format = GGUF V3 (latest) 2024-06-05 09:56:46 llm_load_print_meta: arch = llama 2024-06-05 09:56:46 llm_load_print_meta: vocab type = BPE 2024-06-05 09:56:46 llm_load_print_meta: n_vocab = 128256 2024-06-05 09:56:46 llm_load_print_meta: n_merges = 280147 2024-06-05 09:56:46 llm_load_print_meta: n_ctx_train = 8192 2024-06-05 09:56:46 llm_load_print_meta: n_embd = 4096 2024-06-05 09:56:46 llm_load_print_meta: n_head = 32 2024-06-05 09:56:46 llm_load_print_meta: n_head_kv = 8 2024-06-05 09:56:46 llm_load_print_meta: n_layer = 32 2024-06-05 09:56:46 llm_load_print_meta: n_rot = 128 2024-06-05 09:56:46 llm_load_print_meta: n_embd_head_k = 128 2024-06-05 09:56:46 llm_load_print_meta: n_embd_head_v = 128 2024-06-05 09:56:46 llm_load_print_meta: n_gqa = 4 2024-06-05 09:56:46 llm_load_print_meta: n_embd_k_gqa = 1024 2024-06-05 09:56:46 llm_load_print_meta: n_embd_v_gqa = 1024 2024-06-05 09:56:46 llm_load_print_meta: f_norm_eps = 0.0e+00 2024-06-05 09:56:46 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 2024-06-05 09:56:46 llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-06-05 09:56:46 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-06-05 09:56:46 llm_load_print_meta: f_logit_scale = 0.0e+00 2024-06-05 09:56:46 llm_load_print_meta: n_ff = 14336 2024-06-05 09:56:46 llm_load_print_meta: n_expert = 0 2024-06-05 09:56:46 llm_load_print_meta: n_expert_used = 0 2024-06-05 09:56:46 llm_load_print_meta: causal attn = 1 2024-06-05 09:56:46 llm_load_print_meta: pooling type = 0 2024-06-05 09:56:46 llm_load_print_meta: rope type = 0 2024-06-05 09:56:46 llm_load_print_meta: rope scaling = linear 2024-06-05 09:56:46 llm_load_print_meta: freq_base_train = 500000.0 2024-06-05 09:56:46 llm_load_print_meta: freq_scale_train = 1 2024-06-05 09:56:46 llm_load_print_meta: n_yarn_orig_ctx = 8192 2024-06-05 09:56:46 llm_load_print_meta: rope_finetuned = unknown 2024-06-05 09:56:46 llm_load_print_meta: ssm_d_conv = 0 2024-06-05 09:56:46 llm_load_print_meta: ssm_d_inner = 0 2024-06-05 09:56:46 llm_load_print_meta: ssm_d_state = 0 2024-06-05 09:56:46 llm_load_print_meta: ssm_dt_rank = 0 2024-06-05 09:56:46 llm_load_print_meta: model type = 8B 2024-06-05 09:56:46 llm_load_print_meta: model ftype = Q4_0 2024-06-05 09:56:46 llm_load_print_meta: model params = 8.03 B 2024-06-05 09:56:46 llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) 2024-06-05 09:56:46 llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct 2024-06-05 09:56:46 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' 2024-06-05 09:56:46 llm_load_print_meta: EOS token = 128009 '<|eot_id|>' 2024-06-05 09:56:46 llm_load_print_meta: LF token = 128 'Ä' 2024-06-05 09:56:46 llm_load_print_meta: EOT token = 128009 '<|eot_id|>' 2024-06-05 09:56:46 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes 2024-06-05 09:56:46 ggml_cuda_init: CUDA_USE_TENSOR_CORES: no 2024-06-05 09:56:46 ggml_cuda_init: found 1 CUDA devices: 2024-06-05 09:56:46 Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes 2024-06-05 09:56:46 llm_load_tensors: ggml ctx size = 0.30 MiB 2024-06-05 09:56:46 llm_load_tensors: offloading 32 repeating layers to GPU 2024-06-05 09:56:46 llm_load_tensors: offloading non-repeating layers to GPU 2024-06-05 09:56:46 llm_load_tensors: offloaded 33/33 layers to GPU 2024-06-05 09:56:46 llm_load_tensors: CPU buffer size = 281.81 MiB 2024-06-05 09:56:46 llm_load_tensors: CUDA0 buffer size = 4155.99 MiB 2024-06-05 09:56:47 llama_new_context_with_model: n_ctx = 2048 2024-06-05 09:56:47 llama_new_context_with_model: n_batch = 512 2024-06-05 09:56:47 llama_new_context_with_model: n_ubatch = 512 2024-06-05 09:56:47 llama_new_context_with_model: flash_attn = 0 2024-06-05 09:56:47 llama_new_context_with_model: freq_base = 500000.0 2024-06-05 09:56:47 llama_new_context_with_model: freq_scale = 1 2024-06-05 09:56:47 llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB 2024-06-05 09:56:47 llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB 2024-06-05 09:56:47 llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB 2024-06-05 09:56:47 llama_new_context_with_model: CUDA0 compute buffer size = 258.50 MiB 2024-06-05 09:56:47 llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB 2024-06-05 09:56:47 llama_new_context_with_model: graph nodes = 1030 2024-06-05 09:56:47 llama_new_context_with_model: graph splits = 2 2024-06-05 09:56:47 time=2024-06-05T07:56:47.798Z level=INFO source=server.go:572 msg="llama runner started in 1.76 seconds"
Author
Owner

@om35 commented on GitHub (Jun 5, 2024):

hello, same issue 0.1.41

/tmp/ollama1286469287/runners/cuda_v11/ollama_llama_server: error while loading shared libraries: /usr/local/cuda/lib64/libcudart.so.11.0: file too short
time=2024-06-05T09:34:25.172Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
time=2024-06-05T09:34:25.422Z level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 127 "
[GIN] 2024/06/05 - 09:34:25 | 500 |  498.164855ms |       127.0.0.1 | POST     "/api/chat"
time=2024-06-05T09:34:30.487Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.06438619
time=2024-06-05T09:34:30.737Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.314507407
time=2024-06-05T09:34:30.987Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.564456516

ollama run mixtral:8x7b
Error: llama runner process has terminated: exit status 127

<!-- gh-comment-id:2149348194 --> @om35 commented on GitHub (Jun 5, 2024): hello, same issue 0.1.41 ``` /tmp/ollama1286469287/runners/cuda_v11/ollama_llama_server: error while loading shared libraries: /usr/local/cuda/lib64/libcudart.so.11.0: file too short time=2024-06-05T09:34:25.172Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" time=2024-06-05T09:34:25.422Z level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 127 " [GIN] 2024/06/05 - 09:34:25 | 500 | 498.164855ms | 127.0.0.1 | POST "/api/chat" time=2024-06-05T09:34:30.487Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.06438619 time=2024-06-05T09:34:30.737Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.314507407 time=2024-06-05T09:34:30.987Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.564456516 ``` ollama run mixtral:8x7b Error: llama runner process has terminated: exit status 127
Author
Owner

@dhiltgen commented on GitHub (Jun 6, 2024):

@om35 /usr/local/cuda/lib64/libcudart.so.11.0: file too short sounds like your local cuda install might be corrupt.

@abstract-entity it's a little hard to tell, but it seems like the subprocess is crashing. Can you try with debug enabled so we can see a little more detail?

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

then trigger a model load, and assuming it crashes, share that server.log.

<!-- gh-comment-id:2153553921 --> @dhiltgen commented on GitHub (Jun 6, 2024): @om35 `/usr/local/cuda/lib64/libcudart.so.11.0: file too short` sounds like your local cuda install might be corrupt. @abstract-entity it's a little hard to tell, but it seems like the subprocess is crashing. Can you try with debug enabled so we can see a little more detail? ``` sudo systemctl stop ollama OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log ``` then trigger a model load, and assuming it crashes, share that server.log.
Author
Owner

@pamanseau commented on GitHub (Jun 7, 2024):

What I found out is that NGINX INGRESS is causing this disconnection with the API so as you mentioned the Ollama stopped loading the model and caused this error.

If I connect the WebUI directly in the ClusterIP or NodePort then it's working.

Unless I configure with the Configmap NGINX keepalive timeout but then this is applied to all the cluster not just Ollama ingress so it has implications for other services of the cluster.

The helm chart is deploying Ingress but should look at GATEWAY API or find a way to keep the client connected.

<!-- gh-comment-id:2154193912 --> @pamanseau commented on GitHub (Jun 7, 2024): What I found out is that NGINX INGRESS is causing this disconnection with the API so as you mentioned the Ollama stopped loading the model and caused this error. If I connect the WebUI directly in the ClusterIP or NodePort then it's working. Unless I configure with the Configmap NGINX keepalive timeout but then this is applied to all the cluster not just Ollama ingress so it has implications for other services of the cluster. The helm chart is deploying Ingress but should look at GATEWAY API or find a way to keep the client connected.
Author
Owner

@ghost commented on GitHub (Jun 16, 2024):

Hi, this doesn't happen to me when running ollama as root directly in a shell, but it happens when I start ollama as a service (regardless of the user):

amnesia λ ~/ sudo systemctl status ollama 
● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled)
     Active: active (running) since Sun 2024-06-16 16:47:04 PDT; 39s ago
   Main PID: 7273 (ollama)
      Tasks: 18 (limit: 38365)
     Memory: 561.0M (peak: 615.8M)
        CPU: 5.644s
     CGroup: /system.slice/ollama.service
             └─7273 /usr/local/bin/ollama serve

Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.252-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\""
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.506-07:00 level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sy>
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.511-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\""
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.714-07:00 level=WARN source=sched.go:511 msg="gpu VRAM usage didn't recover within timeout" seconds=5.045401852
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.749-07:00 level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sy>
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.753-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\""
Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.964-07:00 level=WARN source=sched.go:511 msg="gpu VRAM usage didn't recover within timeout" seconds=5.295260429
Jun 16 16:47:38 dead ollama[7273]: time=2024-06-16T16:47:38.007-07:00 level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sy>
Jun 16 16:47:38 dead ollama[7273]: time=2024-06-16T16:47:38.012-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\""
Jun 16 16:47:38 dead ollama[7273]: time=2024-06-16T16:47:38.214-07:00 level=WARN source=sched.go:511 msg="gpu VRAM usage didn't recover within timeout" seconds=5.545797232
                                                                                                                            

But somehow:

amnesia λ ~/ sudo ROCR_VISIBLE_DEVICES=0 HSA_OVERRIDE_GFX_VERSION="10.3.0" OLLAMA_DEBUG=1 ollama serve

Works fine and I can chat without issue. Here's my service file, please note I have tried with both the ollama user and the root user (and the ollama user is properly configured/in render & video group):

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=root
Group=root
Restart=always
RestartSec=3
Environment="PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/rocm/bin:/usr/local/lib/baresip/modules"
Environment="ROCR_VISIBLE_DEVICES=0"
Environment="HSA_OVERRIDE_GFX_VERSION=\"10.3.0\""
[Install]
WantedBy=default.target

Both in the shell & run as a service they report using the same GPU (id=0, 6700XT):

level=INFO source=amd_linux.go:71 msg="inference compute" id=0 library=rocm compute=gfx1031 driver=0.0 name=1002:73df total="12.0 GiB" available="12.0 GiB"

<!-- gh-comment-id:2171941948 --> @ghost commented on GitHub (Jun 16, 2024): Hi, this doesn't happen to me when running ollama as root directly in a shell, but it happens when I start ollama as a service (regardless of the user): ``` amnesia λ ~/ sudo systemctl status ollama ● ollama.service - Ollama Service Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled) Active: active (running) since Sun 2024-06-16 16:47:04 PDT; 39s ago Main PID: 7273 (ollama) Tasks: 18 (limit: 38365) Memory: 561.0M (peak: 615.8M) CPU: 5.644s CGroup: /system.slice/ollama.service └─7273 /usr/local/bin/ollama serve Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.252-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\"" Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.506-07:00 level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sy> Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.511-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\"" Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.714-07:00 level=WARN source=sched.go:511 msg="gpu VRAM usage didn't recover within timeout" seconds=5.045401852 Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.749-07:00 level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sy> Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.753-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\"" Jun 16 16:47:37 dead ollama[7273]: time=2024-06-16T16:47:37.964-07:00 level=WARN source=sched.go:511 msg="gpu VRAM usage didn't recover within timeout" seconds=5.295260429 Jun 16 16:47:38 dead ollama[7273]: time=2024-06-16T16:47:38.007-07:00 level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sy> Jun 16 16:47:38 dead ollama[7273]: time=2024-06-16T16:47:38.012-07:00 level=INFO source=amd_linux.go:304 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION="\"10.3.0\"" Jun 16 16:47:38 dead ollama[7273]: time=2024-06-16T16:47:38.214-07:00 level=WARN source=sched.go:511 msg="gpu VRAM usage didn't recover within timeout" seconds=5.545797232 ``` But somehow: `amnesia λ ~/ sudo ROCR_VISIBLE_DEVICES=0 HSA_OVERRIDE_GFX_VERSION="10.3.0" OLLAMA_DEBUG=1 ollama serve` Works fine and I can chat without issue. Here's my service file, please note I have tried with both the ollama user and the root user (and the ollama user is properly configured/in render & video group): ``` [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=root Group=root Restart=always RestartSec=3 Environment="PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/rocm/bin:/usr/local/lib/baresip/modules" Environment="ROCR_VISIBLE_DEVICES=0" Environment="HSA_OVERRIDE_GFX_VERSION=\"10.3.0\"" [Install] WantedBy=default.target ``` Both in the shell & run as a service they report using the same GPU (id=0, 6700XT): `level=INFO source=amd_linux.go:71 msg="inference compute" id=0 library=rocm compute=gfx1031 driver=0.0 name=1002:73df total="12.0 GiB" available="12.0 GiB"`
Author
Owner

@dhiltgen commented on GitHub (Jun 18, 2024):

@pulpocaminante the upcoming version 0.1.45 (rc2 currently available) will report the GPU env vars in the log at startup which should help you troubleshoot the settings to figure out which one isn't getting passed in as expected.

<!-- gh-comment-id:2176541723 --> @dhiltgen commented on GitHub (Jun 18, 2024): @pulpocaminante the upcoming version 0.1.45 (rc2 currently available) will report the GPU env vars in the log at startup which should help you troubleshoot the settings to figure out which one isn't getting passed in as expected.
Author
Owner

@Tai-Pham-2002 commented on GitHub (Aug 29, 2024):

when i run: ollama run aiden_lu/minicpm-v2.6:Q4_K_M
got this error:
Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed

this my log:
Aug 29 17:35:21 ai-stg ollama[777926]: time=2024-08-29T17:35:21.555Z level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed"
Aug 29 17:35:21 ai-stg ollama[777926]: [GIN] 2024/08/29 - 17:35:21 | 500 | 1.011699134s | 127.0.0.1 | POST "/api/chat"
Aug 29 17:35:26 ai-stg ollama[777926]: time=2024-08-29T17:35:26.718Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.163553423 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1
Aug 29 17:35:27 ai-stg ollama[777926]: time=2024-08-29T17:35:27.072Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.51679798 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1
Aug 29 17:35:27 ai-stg ollama[777926]: time=2024-08-29T17:35:27.424Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.869100664 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1

help me, please

<!-- gh-comment-id:2318469796 --> @Tai-Pham-2002 commented on GitHub (Aug 29, 2024): when i run: ollama run aiden_lu/minicpm-v2.6:Q4_K_M got this error: Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed this my log: Aug 29 17:35:21 ai-stg ollama[777926]: time=2024-08-29T17:35:21.555Z level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed" Aug 29 17:35:21 ai-stg ollama[777926]: [GIN] 2024/08/29 - 17:35:21 | 500 | 1.011699134s | 127.0.0.1 | POST "/api/chat" Aug 29 17:35:26 ai-stg ollama[777926]: time=2024-08-29T17:35:26.718Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.163553423 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1 Aug 29 17:35:27 ai-stg ollama[777926]: time=2024-08-29T17:35:27.072Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.51679798 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1 Aug 29 17:35:27 ai-stg ollama[777926]: time=2024-08-29T17:35:27.424Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.869100664 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1 help me, please
Author
Owner

@Leon-Sander commented on GitHub (Sep 23, 2024):

@dhiltgen
I have no problems on linux but get this error on windows. My application uses ollama as llm server, and many users work on windows and experience this error. I also added the "keep_alive": -1 into the api request but it didnt change the result. I am using the latest docker image.

ollama-1 | 2024/09/23 10:28:22 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
ollama-1 | time=2024-09-23T10:28:22.189Z level=INFO source=images.go:753 msg="total blobs: 20"
ollama-1 | time=2024-09-23T10:28:22.307Z level=INFO source=images.go:760 msg="total unused blobs removed: 0"
ollama-1 | time=2024-09-23T10:28:22.411Z level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
ollama-1 | time=2024-09-23T10:28:22.416Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12]"
ollama-1 | time=2024-09-23T10:28:22.416Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
ollama-1 | time=2024-09-23T10:28:23.178Z level=INFO source=types.go:107 msg="inference compute" id=GPU-3c819917-39e0-af79-6f9b-db1d227a2872 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3070" total="8.0 GiB" available="6.9 GiB"
ollama-1 | time=2024-09-23T10:28:56.963Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single
GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-3c819917-39e0-af79-6f9b-db1d227a2872 parallel=4 available=7444889600 required="6.2 GiB"
ollama-1 | time=2024-09-23T10:28:56.963Z level=INFO source=server.go:103 msg="system memory" total="7.7 GiB" free="6.4 GiB"
free_swap="2.0 GiB"
ollama-1 | time=2024-09-23T10:28:56.965Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
ollama-1 | time=2024-09-23T10:28:56.969Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 41745"
ollama-1 | time=2024-09-23T10:28:56.970Z level=INFO source=sched.go:449 msg="loaded runners" count=1
ollama-1 | time=2024-09-23T10:28:56.970Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
ollama-1 | time=2024-09-23T10:28:56.971Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
ollama-1 | INFO [main] build info | build=10 commit="eaf151c" tid="140164179275776" timestamp=1727087337
ollama-1 | INFO [main] system info | n_threads=6 n_threads_batch=6 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA =
0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140164179275776" timestamp=1727087337 total_threads=12
ollama-1 | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="41745" tid="140164179275776" timestamp=1727087337
ollama-1 | time=2024-09-23T10:28:57.223Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
ollama-1 | llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
ollama-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1 | llama_model_loader: - kv 0: general.architecture str = llama
ollama-1 | llama_model_loader: - kv 1: general.type str = model
ollama-1 | llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
ollama-1 | llama_model_loader: - kv 3: general.finetune str = Instruct
ollama-1 | llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
ollama-1 | llama_model_loader: - kv 5: general.size_label str = 8B
ollama-1 | llama_model_loader: - kv 6: general.license str = llama3.1
ollama-1 | llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta",
"pytorch", "llam...
ollama-1 | llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
ollama-1 | llama_model_loader: - kv 9: llama.block_count u32 = 32
ollama-1 | llama_model_loader: - kv 10: llama.context_length u32 = 131072
ollama-1 | llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
ollama-1 | llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
ollama-1 | llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
ollama-1 | llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
ollama-1 | llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
ollama-1 | llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
ollama-1 | llama_model_loader: - kv 17: general.file_type u32 = 2
ollama-1 | llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
ollama-1 | llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
ollama-1 | llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
ollama-1 | llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
ollama-1 | llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
ollama-1 | llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1 | llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ
ĠĠ", "...
ollama-1 | llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
ollama-1 | llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
ollama-1 | llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
ollama-1 | llama_model_loader: - kv 28: general.quantization_version u32 = 2
ollama-1 | llama_model_loader: - type f32: 66 tensors
ollama-1 | llama_model_loader: - type q4_0: 225 tensors
ollama-1 | llama_model_loader: - type q6_K: 1 tensors
ollama-1 | llm_load_vocab: special tokens cache size = 256
ollama-1 | llm_load_vocab: token to piece cache size = 0.7999 MB
ollama-1 | llm_load_print_meta: format = GGUF V3 (latest)
ollama-1 | llm_load_print_meta: arch = llama
ollama-1 | llm_load_print_meta: vocab type = BPE
ollama-1 | llm_load_print_meta: n_vocab = 128256
ollama-1 | llm_load_print_meta: n_merges = 280147
ollama-1 | llm_load_print_meta: vocab_only = 0
ollama-1 | llm_load_print_meta: n_ctx_train = 131072
ollama-1 | llm_load_print_meta: n_embd = 4096
ollama-1 | llm_load_print_meta: n_layer = 32
ollama-1 | llm_load_print_meta: n_head = 32
ollama-1 | llm_load_print_meta: n_head_kv = 8
ollama-1 | llm_load_print_meta: n_rot = 128
ollama-1 | llm_load_print_meta: n_swa = 0
ollama-1 | llm_load_print_meta: n_embd_head_k = 128
ollama-1 | llm_load_print_meta: n_embd_head_v = 128
ollama-1 | llm_load_print_meta: n_gqa = 4
ollama-1 | llm_load_print_meta: n_embd_k_gqa = 1024
ollama-1 | llm_load_print_meta: n_embd_v_gqa = 1024
ollama-1 | llm_load_print_meta: f_norm_eps = 0.0e+00
ollama-1 | llm_load_print_meta: f_norm_rms_eps = 1.0e-05
ollama-1 | llm_load_print_meta: f_clamp_kqv = 0.0e+00
ollama-1 | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1 | llm_load_print_meta: f_logit_scale = 0.0e+00
ollama-1 | llm_load_print_meta: n_ff = 14336
ollama-1 | llm_load_print_meta: n_expert = 0
ollama-1 | llm_load_print_meta: n_expert_used = 0
ollama-1 | llm_load_print_meta: causal attn = 1
ollama-1 | llm_load_print_meta: pooling type = 0
ollama-1 | llm_load_print_meta: rope type = 0
ollama-1 | llm_load_print_meta: rope scaling = linear
ollama-1 | llm_load_print_meta: freq_base_train = 500000.0
ollama-1 | llm_load_print_meta: freq_scale_train = 1
ollama-1 | llm_load_print_meta: n_ctx_orig_yarn = 131072
ollama-1 | llm_load_print_meta: rope_finetuned = unknown
ollama-1 | llm_load_print_meta: ssm_d_conv = 0
ollama-1 | llm_load_print_meta: ssm_d_inner = 0
ollama-1 | llm_load_print_meta: ssm_d_state = 0
ollama-1 | llm_load_print_meta: ssm_dt_rank = 0
ollama-1 | llm_load_print_meta: ssm_dt_b_c_rms = 0
ollama-1 | llm_load_print_meta: model type = 8B
ollama-1 | llm_load_print_meta: model ftype = Q4_0
ollama-1 | llm_load_print_meta: model params = 8.03 B
ollama-1 | llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
ollama-1 | llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
ollama-1 | llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
ollama-1 | llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: LF token = 128 'Ä'
ollama-1 | llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: max token length = 256
ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama-1 | ggml_cuda_init: found 1 CUDA devices:
ollama-1 | Device 0: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes
ollama-1 | llm_load_tensors: ggml ctx size = 0.27 MiB
ollama-1 | time=2024-09-23T10:33:57.075Z level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - "
ollama-1 | [GIN] 2024/09/23 - 10:33:57 | 500 | 5m0s | 172.18.0.3 | POST "/api/chat"
app-1 | {'error': 'timed out waiting for llama runner to start - progress 0.00 - '}
ollama-1 | time=2024-09-23T10:34:02.206Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.131559562 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
ollama-1 | time=2024-09-23T10:34:02.455Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.380732173 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
ollama-1 | time=2024-09-23T10:34:02.716Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.641832884 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe

<!-- gh-comment-id:2367835670 --> @Leon-Sander commented on GitHub (Sep 23, 2024): @dhiltgen I have no problems on linux but get this error on windows. My application uses ollama as llm server, and many users work on windows and experience this error. I also added the "keep_alive": -1 into the api request but it didnt change the result. I am using the latest docker image. ollama-1 | 2024/09/23 10:28:22 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" ollama-1 | time=2024-09-23T10:28:22.189Z level=INFO source=images.go:753 msg="total blobs: 20" ollama-1 | time=2024-09-23T10:28:22.307Z level=INFO source=images.go:760 msg="total unused blobs removed: 0" ollama-1 | time=2024-09-23T10:28:22.411Z level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" ollama-1 | time=2024-09-23T10:28:22.416Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12]" ollama-1 | time=2024-09-23T10:28:22.416Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs" ollama-1 | time=2024-09-23T10:28:23.178Z level=INFO source=types.go:107 msg="inference compute" id=GPU-3c819917-39e0-af79-6f9b-db1d227a2872 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3070" total="8.0 GiB" available="6.9 GiB" ollama-1 | time=2024-09-23T10:28:56.963Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-3c819917-39e0-af79-6f9b-db1d227a2872 parallel=4 available=7444889600 required="6.2 GiB" ollama-1 | time=2024-09-23T10:28:56.963Z level=INFO source=server.go:103 msg="system memory" total="7.7 GiB" free="6.4 GiB" free_swap="2.0 GiB" ollama-1 | time=2024-09-23T10:28:56.965Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" ollama-1 | time=2024-09-23T10:28:56.969Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 41745" ollama-1 | time=2024-09-23T10:28:56.970Z level=INFO source=sched.go:449 msg="loaded runners" count=1 ollama-1 | time=2024-09-23T10:28:56.970Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding" ollama-1 | time=2024-09-23T10:28:56.971Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" ollama-1 | INFO [main] build info | build=10 commit="eaf151c" tid="140164179275776" timestamp=1727087337 ollama-1 | INFO [main] system info | n_threads=6 n_threads_batch=6 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140164179275776" timestamp=1727087337 total_threads=12 ollama-1 | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="41745" tid="140164179275776" timestamp=1727087337 ollama-1 | time=2024-09-23T10:28:57.223Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" ollama-1 | llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) ollama-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ollama-1 | llama_model_loader: - kv 0: general.architecture str = llama ollama-1 | llama_model_loader: - kv 1: general.type str = model ollama-1 | llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct ollama-1 | llama_model_loader: - kv 3: general.finetune str = Instruct ollama-1 | llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 ollama-1 | llama_model_loader: - kv 5: general.size_label str = 8B ollama-1 | llama_model_loader: - kv 6: general.license str = llama3.1 ollama-1 | llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... ollama-1 | llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... ollama-1 | llama_model_loader: - kv 9: llama.block_count u32 = 32 ollama-1 | llama_model_loader: - kv 10: llama.context_length u32 = 131072 ollama-1 | llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 ollama-1 | llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 ollama-1 | llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 ollama-1 | llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 ollama-1 | llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 ollama-1 | llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 ollama-1 | llama_model_loader: - kv 17: general.file_type u32 = 2 ollama-1 | llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 ollama-1 | llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 ollama-1 | llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 ollama-1 | llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe ollama-1 | llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... ollama-1 | llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ollama-1 | llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... ollama-1 | llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 ollama-1 | llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 ollama-1 | llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... ollama-1 | llama_model_loader: - kv 28: general.quantization_version u32 = 2 ollama-1 | llama_model_loader: - type f32: 66 tensors ollama-1 | llama_model_loader: - type q4_0: 225 tensors ollama-1 | llama_model_loader: - type q6_K: 1 tensors ollama-1 | llm_load_vocab: special tokens cache size = 256 ollama-1 | llm_load_vocab: token to piece cache size = 0.7999 MB ollama-1 | llm_load_print_meta: format = GGUF V3 (latest) ollama-1 | llm_load_print_meta: arch = llama ollama-1 | llm_load_print_meta: vocab type = BPE ollama-1 | llm_load_print_meta: n_vocab = 128256 ollama-1 | llm_load_print_meta: n_merges = 280147 ollama-1 | llm_load_print_meta: vocab_only = 0 ollama-1 | llm_load_print_meta: n_ctx_train = 131072 ollama-1 | llm_load_print_meta: n_embd = 4096 ollama-1 | llm_load_print_meta: n_layer = 32 ollama-1 | llm_load_print_meta: n_head = 32 ollama-1 | llm_load_print_meta: n_head_kv = 8 ollama-1 | llm_load_print_meta: n_rot = 128 ollama-1 | llm_load_print_meta: n_swa = 0 ollama-1 | llm_load_print_meta: n_embd_head_k = 128 ollama-1 | llm_load_print_meta: n_embd_head_v = 128 ollama-1 | llm_load_print_meta: n_gqa = 4 ollama-1 | llm_load_print_meta: n_embd_k_gqa = 1024 ollama-1 | llm_load_print_meta: n_embd_v_gqa = 1024 ollama-1 | llm_load_print_meta: f_norm_eps = 0.0e+00 ollama-1 | llm_load_print_meta: f_norm_rms_eps = 1.0e-05 ollama-1 | llm_load_print_meta: f_clamp_kqv = 0.0e+00 ollama-1 | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 ollama-1 | llm_load_print_meta: f_logit_scale = 0.0e+00 ollama-1 | llm_load_print_meta: n_ff = 14336 ollama-1 | llm_load_print_meta: n_expert = 0 ollama-1 | llm_load_print_meta: n_expert_used = 0 ollama-1 | llm_load_print_meta: causal attn = 1 ollama-1 | llm_load_print_meta: pooling type = 0 ollama-1 | llm_load_print_meta: rope type = 0 ollama-1 | llm_load_print_meta: rope scaling = linear ollama-1 | llm_load_print_meta: freq_base_train = 500000.0 ollama-1 | llm_load_print_meta: freq_scale_train = 1 ollama-1 | llm_load_print_meta: n_ctx_orig_yarn = 131072 ollama-1 | llm_load_print_meta: rope_finetuned = unknown ollama-1 | llm_load_print_meta: ssm_d_conv = 0 ollama-1 | llm_load_print_meta: ssm_d_inner = 0 ollama-1 | llm_load_print_meta: ssm_d_state = 0 ollama-1 | llm_load_print_meta: ssm_dt_rank = 0 ollama-1 | llm_load_print_meta: ssm_dt_b_c_rms = 0 ollama-1 | llm_load_print_meta: model type = 8B ollama-1 | llm_load_print_meta: model ftype = Q4_0 ollama-1 | llm_load_print_meta: model params = 8.03 B ollama-1 | llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) ollama-1 | llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct ollama-1 | llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' ollama-1 | llm_load_print_meta: EOS token = 128009 '<|eot_id|>' ollama-1 | llm_load_print_meta: LF token = 128 'Ä' ollama-1 | llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ollama-1 | llm_load_print_meta: max token length = 256 ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ollama-1 | ggml_cuda_init: found 1 CUDA devices: ollama-1 | Device 0: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes ollama-1 | llm_load_tensors: ggml ctx size = 0.27 MiB ollama-1 | time=2024-09-23T10:33:57.075Z level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - " ollama-1 | [GIN] 2024/09/23 - 10:33:57 | 500 | 5m0s | 172.18.0.3 | POST "/api/chat" app-1 | {'error': 'timed out waiting for llama runner to start - progress 0.00 - '} ollama-1 | time=2024-09-23T10:34:02.206Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.131559562 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe ollama-1 | time=2024-09-23T10:34:02.455Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.380732173 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe ollama-1 | time=2024-09-23T10:34:02.716Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.641832884 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
Author
Owner

@dhiltgen commented on GitHub (Sep 24, 2024):

@Leon-Sander you can se OLLAMA_LOAD_TIMEOUT to adjust the timeout if your system needs more than 5m to load the model. If that doesn't help get you past it, please open a new issue with your server logs so we can investigate.

<!-- gh-comment-id:2371760911 --> @dhiltgen commented on GitHub (Sep 24, 2024): @Leon-Sander you can se OLLAMA_LOAD_TIMEOUT to adjust the timeout if your system needs more than 5m to load the model. If that doesn't help get you past it, please open a new issue with your server logs so we can investigate.
Author
Owner

@Leon-Sander commented on GitHub (Sep 27, 2024):

@dhiltgen got it, thanks.

Do you have an Idea why the loading/offloading on gpu takes that much time on windows? On linux llama3.1 is loaded in 10 seconds, but on windows it takes me 5 minutes on the same computer. I have good hardware.
Inference time seems to be pretty much the same as on linux, just the model loading ins unbearable.

Edit: The answer seems to be the wsl2 based docker image I/O back to the NTFS filesystem can be slow #6006
I just tested it without docker and the loading is as fast as on linux.

<!-- gh-comment-id:2378958630 --> @Leon-Sander commented on GitHub (Sep 27, 2024): @dhiltgen got it, thanks. Do you have an Idea why the loading/offloading on gpu takes that much time on windows? On linux llama3.1 is loaded in 10 seconds, but on windows it takes me 5 minutes on the same computer. I have good hardware. Inference time seems to be pretty much the same as on linux, just the model loading ins unbearable. Edit: The answer seems to be the wsl2 based docker image I/O back to the NTFS filesystem can be slow #6006 I just tested it without docker and the loading is as fast as on linux.
Author
Owner

@JoffreyLemeryAncileo commented on GitHub (Nov 20, 2024):

@dhiltgen, thank for all your activity !

For my company i do some testing to see on which instance type on Azure we gonna deploy our model.
I installed ollama version is 0.4.1 on ubuntu 22.04 - NC64asT4v3 series

I installed Cuda 12.6 and i have all the GPU available.
As for many, i uptaded my ollama.service like this to make it accessible by API call :

`[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
Environment="OLLAMA_LOG_LEVEL=debug"
Environment="OLLAMA_GPU=1"

[Install]
WantedBy=default.target`

When i run gemna2;27B for example, it is quite long depiste 4GPUs.
I dig a bit and see than just 4/47 of layers are offloads, the CPU explose and GPU are almost not used.

==> I duplicate the model with a Modelfile and control the offloading with : PARAMETER num_gpu 47
I also made it inside gemma2:27B using / set_parameter num_gpu 47

It increase a bit the performance. I want now to validate with API call from local laptop. However, no matter the model i call or the parameter, it is always a small offloading 4/47

I understand from other ticket you repleid that ollam aims to calculate itself the vram. Which is strange because with the 4 GPU, i have enough vRAM to load all layers and run them theorically.

So i wonder it the calculation works really and if there is a way to by pass it using python request payload
A point a noticied : the logs at the begining always tell me that the GPU vRAM didn't recover

LOGS :
Nov 20 10:17:57 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:17:57.777Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.381064304 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:17:59 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:17:59.348Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=7.952129341 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:18:00 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:00.890Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=9.494299231 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:18:02 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:02.431Z level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc library=cuda parallel=1 required="46.3 GiB" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.039Z level=INFO source=server.go:105 msg="system memory" total="432.9 GiB" free="427.0 GiB" free_swap="0 B" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.040Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=4 layers.model=47 layers.offload=4 layers.split=1,1,1,1 memory.available="[14.5 GiB 14.5 GiB 14.5 GiB 14.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="99.8 GiB" memory.required.partial="46.3 GiB" memory.required.kv="44.9 GiB" memory.required.allocations="[11.6 GiB 11.6 GiB 11.6 GiB 11.6 GiB]" memory.weights.total="58.6 GiB" memory.weights.repeating="57.7 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="8.6 GiB" memory.graph.partial="8.6 GiB" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.042Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3490198133/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 128000 --batch-size 512 --n-gpu-layers 4 --threads 64 --parallel 1 --tensor-split 1,1,1,1 --port 45325" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.043Z level=INFO source=sched.go:449 msg="loaded runners" count=1 Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.043Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.044Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.069Z level=INFO source=runner.go:863 msg="starting go runner"

[...]

Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: ggml_cuda_init: found 4 CUDA devices: Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 0: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 1: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 2: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 3: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: ggml ctx size = 1.14 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: offloading 4 repeating layers to GPU Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: offloaded 4/47 layers to GPU Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CPU buffer size = 14898.60 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA0 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA1 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA2 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA3 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_ctx = 128000 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_batch = 512 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_ubatch = 512 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: flash_attn = 0 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: freq_base = 10000.0 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: freq_scale = 1 Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA_Host KV buffer size = 42000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA0 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA1 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA2 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA3 KV buffer size = 1000.00 MiB

Is y GPU config wromng which create vRAM unrelease or the offloading 4/47 is really the maximum, or the num_gpu automatic calculation is wrong ? Or maybe all of them ?

Cheers and thanks !

<!-- gh-comment-id:2488211849 --> @JoffreyLemeryAncileo commented on GitHub (Nov 20, 2024): @dhiltgen, thank for all your activity ! For my company i do some testing to see on which instance type on Azure we gonna deploy our model. I installed ollama version is 0.4.1 on ubuntu 22.04 - NC64asT4v3 series I installed Cuda 12.6 and i have all the GPU available. As for many, i uptaded my ollama.service like this to make it accessible by API call : `[Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" Environment="OLLAMA_HOST=0.0.0.0" Environment="CUDA_VISIBLE_DEVICES=0,1,2,3" Environment="OLLAMA_LOG_LEVEL=debug" Environment="OLLAMA_GPU=1" [Install] WantedBy=default.target` When i run gemna2;27B for example, it is quite long depiste 4GPUs. I dig a bit and see than just 4/47 of layers are offloads, the CPU explose and GPU are almost not used. ==> I duplicate the model with a Modelfile and control the offloading with : PARAMETER num_gpu 47 I also made it inside gemma2:27B using / set_parameter num_gpu 47 It increase a bit the performance. I want now to validate with API call from local laptop. However, no matter the model i call or the parameter, it is always a small offloading 4/47 I understand from other ticket you repleid that ollam aims to calculate itself the vram. Which is strange because with the 4 GPU, i have enough vRAM to load all layers and run them theorically. So i wonder it the calculation works really and if there is a way to by pass it using python request payload A point a noticied : the logs at the begining always tell me that the GPU vRAM didn't recover LOGS : `Nov 20 10:17:57 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:17:57.777Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.381064304 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:17:59 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:17:59.348Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=7.952129341 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:18:00 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:00.890Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=9.494299231 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:18:02 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:02.431Z level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc library=cuda parallel=1 required="46.3 GiB" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.039Z level=INFO source=server.go:105 msg="system memory" total="432.9 GiB" free="427.0 GiB" free_swap="0 B" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.040Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=4 layers.model=47 layers.offload=4 layers.split=1,1,1,1 memory.available="[14.5 GiB 14.5 GiB 14.5 GiB 14.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="99.8 GiB" memory.required.partial="46.3 GiB" memory.required.kv="44.9 GiB" memory.required.allocations="[11.6 GiB 11.6 GiB 11.6 GiB 11.6 GiB]" memory.weights.total="58.6 GiB" memory.weights.repeating="57.7 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="8.6 GiB" memory.graph.partial="8.6 GiB" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.042Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3490198133/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 128000 --batch-size 512 --n-gpu-layers 4 --threads 64 --parallel 1 --tensor-split 1,1,1,1 --port 45325" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.043Z level=INFO source=sched.go:449 msg="loaded runners" count=1 Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.043Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.044Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.069Z level=INFO source=runner.go:863 msg="starting go runner"` [...] `Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: ggml_cuda_init: found 4 CUDA devices: Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 0: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 1: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 2: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 3: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: ggml ctx size = 1.14 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: offloading 4 repeating layers to GPU Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: offloaded 4/47 layers to GPU Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CPU buffer size = 14898.60 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA0 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA1 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA2 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA3 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_ctx = 128000 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_batch = 512 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_ubatch = 512 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: flash_attn = 0 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: freq_base = 10000.0 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: freq_scale = 1 Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA_Host KV buffer size = 42000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA0 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA1 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA2 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA3 KV buffer size = 1000.00 MiB` Is y GPU config wromng which create vRAM unrelease or the offloading 4/47 is really the maximum, or the num_gpu automatic calculation is wrong ? Or maybe all of them ? Cheers and thanks !
Author
Owner

@yanzhenxu99 commented on GitHub (Feb 10, 2025):

same error inside docker, log:
time=2025-02-10T09:53:08.738+08:00 level=INFO source=server.go:104 msg="system memory" total="1007.5 GiB" free="956.4 GiB" free_swap="0 B" time=2025-02-10T09:53:08.739+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=50 layers.model=62 layers.offload=37 layers.split=19,18 memory.available="[78.9 GiB 78.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="244.4 GiB" memory.required.partial="153.6 GiB" memory.required.kv="95.3 GiB" memory.required.allocations="[77.7 GiB 75.8 GiB]" memory.weights.total="224.7 GiB" memory.weights.repeating="224.0 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="7.1 GiB" memory.graph.partial="7.1 GiB" time=2025-02-10T09:53:08.741+08:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v11_avx/ollama_llama_server runner --model /nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 --ctx-size 20480 --batch-size 512 --n-gpu-layers 50 --threads 127 --parallel 5 --tensor-split 19,18 --port 57150" time=2025-02-10T09:53:08.743+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-10T09:53:08.743+08:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-02-10T09:53:08.744+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-02-10T09:53:08.758+08:00 level=INFO source=runner.go:936 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA A800-SXM4-80GB, compute capability 8.0, VMM: yes Device 1: NVIDIA A800-SXM4-80GB, compute capability 8.0, VMM: yes time=2025-02-10T09:53:08.792+08:00 level=INFO source=runner.go:937 msg=system info="CUDA : USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=127 time=2025-02-10T09:53:08.793+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:57150" time=2025-02-10T09:53:08.995+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_load_model_from_file: using device CUDA0 (NVIDIA A800-SXM4-80GB) - 80727 MiB free llama_load_model_from_file: using device CUDA1 (NVIDIA A800-SXM4-80GB) - 80727 MiB free llama_model_loader: loaded meta data with 52 key-value pairs and 1025 tensors from /nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16 llama_model_loader: - kv 3: general.quantized_by str = Unsloth llama_model_loader: - kv 4: general.size_label str = 256x20B llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 6: deepseek2.block_count u32 = 61 llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128 llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 15: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 16: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 17: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 18: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 19: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 20: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 21: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 22: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 23: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 24: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 25: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 26: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 27: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 28: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 29: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 30: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 31: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 33: tokenizer.ggml.pre str = deepseek-v3 llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�... llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e... llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 128815 llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 42: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 43: general.quantization_version u32 = 2 llama_model_loader: - kv 44: general.file_type u32 = 24 llama_model_loader: - kv 45: quantize.imatrix.file str = DeepSeek-R1.imatrix llama_model_loader: - kv 46: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt llama_model_loader: - kv 47: quantize.imatrix.entries_count i32 = 720 llama_model_loader: - kv 48: quantize.imatrix.chunks_count i32 = 124 llama_model_loader: - kv 49: split.no u16 = 0 llama_model_loader: - kv 50: split.tensors.count i32 = 1025 llama_model_loader: - kv 51: split.count u16 = 0 llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q4_K: 190 tensors llama_model_loader: - type q5_K: 116 tensors llama_model_loader: - type q6_K: 184 tensors llama_model_loader: - type iq2_xxs: 6 tensors llama_model_loader: - type iq1_s: 168 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 819 llm_load_vocab: token to piece cache size = 0.8223 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = deepseek2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 129280 llm_load_print_meta: n_merges = 127741 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 163840 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_layer = 61 llm_load_print_meta: n_head = 128 llm_load_print_meta: n_head_kv = 128 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 192 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 24576 llm_load_print_meta: n_embd_v_gqa = 16384 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18432 llm_load_print_meta: n_expert = 256 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = yarn llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 0.025 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 671B llm_load_print_meta: model ftype = IQ1_S - 1.5625 bpw llm_load_print_meta: model params = 671.03 B llm_load_print_meta: model size = 130.60 GiB (1.67 BPW) llm_load_print_meta: general.name = DeepSeek R1 BF16 llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>' llm_load_print_meta: LF token = 131 'Ä' llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>' llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>' llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>' llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_layer_dense_lead = 3 llm_load_print_meta: n_lora_q = 1536 llm_load_print_meta: n_lora_kv = 512 llm_load_print_meta: n_ff_exp = 2048 llm_load_print_meta: n_expert_shared = 1 llm_load_print_meta: expert_weights_scale = 2.5 llm_load_print_meta: expert_weights_norm = 1 llm_load_print_meta: expert_gating_func = sigmoid llm_load_print_meta: rope_yarn_log_mul = 0.1000 time=2025-02-10T09:58:08.799+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - " [GIN] 2025/02/10 - 09:58:08 | 500 | 5m0s | 127.0.0.1 | POST "/api/generate" time=2025-02-10T09:58:13.930+08:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.130783519 model=/nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 time=2025-02-10T09:58:14.226+08:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.426999476 model=/nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 time=2025-02-10T09:58:14.527+08:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.727853776 model=/nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6

nvidia-smi gpu usage:

Image
<!-- gh-comment-id:2646765030 --> @yanzhenxu99 commented on GitHub (Feb 10, 2025): same error inside docker, log: `time=2025-02-10T09:53:08.738+08:00 level=INFO source=server.go:104 msg="system memory" total="1007.5 GiB" free="956.4 GiB" free_swap="0 B" time=2025-02-10T09:53:08.739+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=50 layers.model=62 layers.offload=37 layers.split=19,18 memory.available="[78.9 GiB 78.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="244.4 GiB" memory.required.partial="153.6 GiB" memory.required.kv="95.3 GiB" memory.required.allocations="[77.7 GiB 75.8 GiB]" memory.weights.total="224.7 GiB" memory.weights.repeating="224.0 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="7.1 GiB" memory.graph.partial="7.1 GiB" time=2025-02-10T09:53:08.741+08:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v11_avx/ollama_llama_server runner --model /nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 --ctx-size 20480 --batch-size 512 --n-gpu-layers 50 --threads 127 --parallel 5 --tensor-split 19,18 --port 57150" time=2025-02-10T09:53:08.743+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-10T09:53:08.743+08:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-02-10T09:53:08.744+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-02-10T09:53:08.758+08:00 level=INFO source=runner.go:936 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA A800-SXM4-80GB, compute capability 8.0, VMM: yes Device 1: NVIDIA A800-SXM4-80GB, compute capability 8.0, VMM: yes time=2025-02-10T09:53:08.792+08:00 level=INFO source=runner.go:937 msg=system info="CUDA : USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=127 time=2025-02-10T09:53:08.793+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:57150" time=2025-02-10T09:53:08.995+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_load_model_from_file: using device CUDA0 (NVIDIA A800-SXM4-80GB) - 80727 MiB free llama_load_model_from_file: using device CUDA1 (NVIDIA A800-SXM4-80GB) - 80727 MiB free llama_model_loader: loaded meta data with 52 key-value pairs and 1025 tensors from /nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16 llama_model_loader: - kv 3: general.quantized_by str = Unsloth llama_model_loader: - kv 4: general.size_label str = 256x20B llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 6: deepseek2.block_count u32 = 61 llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128 llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 15: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 16: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 17: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 18: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 19: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 20: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 21: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 22: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 23: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 24: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 25: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 26: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 27: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 28: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 29: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 30: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 31: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 33: tokenizer.ggml.pre str = deepseek-v3 llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�... llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e... llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 128815 llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 42: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 43: general.quantization_version u32 = 2 llama_model_loader: - kv 44: general.file_type u32 = 24 llama_model_loader: - kv 45: quantize.imatrix.file str = DeepSeek-R1.imatrix llama_model_loader: - kv 46: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt llama_model_loader: - kv 47: quantize.imatrix.entries_count i32 = 720 llama_model_loader: - kv 48: quantize.imatrix.chunks_count i32 = 124 llama_model_loader: - kv 49: split.no u16 = 0 llama_model_loader: - kv 50: split.tensors.count i32 = 1025 llama_model_loader: - kv 51: split.count u16 = 0 llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q4_K: 190 tensors llama_model_loader: - type q5_K: 116 tensors llama_model_loader: - type q6_K: 184 tensors llama_model_loader: - type iq2_xxs: 6 tensors llama_model_loader: - type iq1_s: 168 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 819 llm_load_vocab: token to piece cache size = 0.8223 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = deepseek2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 129280 llm_load_print_meta: n_merges = 127741 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 163840 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_layer = 61 llm_load_print_meta: n_head = 128 llm_load_print_meta: n_head_kv = 128 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 192 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 24576 llm_load_print_meta: n_embd_v_gqa = 16384 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18432 llm_load_print_meta: n_expert = 256 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = yarn llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 0.025 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 671B llm_load_print_meta: model ftype = IQ1_S - 1.5625 bpw llm_load_print_meta: model params = 671.03 B llm_load_print_meta: model size = 130.60 GiB (1.67 BPW) llm_load_print_meta: general.name = DeepSeek R1 BF16 llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>' llm_load_print_meta: LF token = 131 'Ä' llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>' llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>' llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>' llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_layer_dense_lead = 3 llm_load_print_meta: n_lora_q = 1536 llm_load_print_meta: n_lora_kv = 512 llm_load_print_meta: n_ff_exp = 2048 llm_load_print_meta: n_expert_shared = 1 llm_load_print_meta: expert_weights_scale = 2.5 llm_load_print_meta: expert_weights_norm = 1 llm_load_print_meta: expert_gating_func = sigmoid llm_load_print_meta: rope_yarn_log_mul = 0.1000 time=2025-02-10T09:58:08.799+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - " [GIN] 2025/02/10 - 09:58:08 | 500 | 5m0s | 127.0.0.1 | POST "/api/generate" time=2025-02-10T09:58:13.930+08:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.130783519 model=/nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 time=2025-02-10T09:58:14.226+08:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.426999476 model=/nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 time=2025-02-10T09:58:14.527+08:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.727853776 model=/nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6` nvidia-smi gpu usage: <img width="479" alt="Image" src="https://github.com/user-attachments/assets/5950c48b-7619-414b-a4d5-9dc8ae46dece" />
Author
Owner

@thistlillo commented on GitHub (Mar 27, 2025):

I get this message "gpu VRAM usage didn't recover within timeout" with almost all the models available today. This one was generated while interacting with Qwen2.5

gpu VRAM usage didn't recover within timeout

<!-- gh-comment-id:2758372285 --> @thistlillo commented on GitHub (Mar 27, 2025): I get this message "gpu VRAM usage didn't recover within timeout" with almost all the models available today. This one was generated while interacting with Qwen2.5 gpu VRAM usage didn't recover within timeout
Author
Owner

@marksverdhei commented on GitHub (Mar 31, 2025):

I'm getting this issue exclusively with Gemma-3-27b. QWQ 32b works, gemma-3-12b works. 50GB vram. Pretty strange...

<!-- gh-comment-id:2767327160 --> @marksverdhei commented on GitHub (Mar 31, 2025): I'm getting this issue exclusively with Gemma-3-27b. QWQ 32b works, gemma-3-12b works. 50GB vram. Pretty strange...
Author
Owner

@radishlee commented on GitHub (Apr 3, 2025):

user@ubuntu:~$ systemctl status ollama
● ollama.service - Ollama Service
Loaded: loaded (/etc/systemd/system/ollama.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2025-04-02 13:10:43 CST; 20h ago
Main PID: 1152 (ollama)
Tasks: 30 (limit: 74767)
Memory: 16.6G
CPU: 8h 41min 34.743s
CGroup: /system.slice/ollama.service
├─ 1152 /usr/local/bin/ollama serve
└─2256926 /usr/local/bin/ollama runner --model /data/ollama/.ollama/models/blobs/sha256-60cfdbde0472c3b850493551288a152f0858a0d1974964d6925c2b908035db76 --ctx-size 16384 --batch-size 512 --n-gpu-layers 31 --threads 8 --parallel 4 -->

4月 03 10:01:44 ubuntu ollama[1152]: llama_kv_cache_init: CPU KV buffer size = 7680.00 MiB
4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: KV self size = 7680.00 MiB, K (f16): 3840.00 MiB, V (f16): 3840.00 MiB
4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: CPU output buffer size = 1.62 MiB
4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: CPU compute buffer size = 1088.01 MiB
4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: graph nodes = 966
4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: graph splits = 1
4月 03 10:01:44 ubuntu ollama[1152]: time=2025-04-03T10:01:44.733+08:00 level=INFO source=server.go:619 msg="llama runner started in 2.52 seconds"
4月 03 10:06:36 ubuntu ollama[1152]: time=2025-04-03T10:06:36.330+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.195915848 model=/data/ollama/.ollama/models/blobs/sha256-a9d3d622b517bcf8150341f0>
4月 03 10:06:36 ubuntu ollama[1152]: time=2025-04-03T10:06:36.536+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.40198618 model=/data/ollama/.ollama/models/blobs/sha256-a9d3d622b517bcf8150341f07>
4月 03 10:06:36 ubuntu ollama[1152]: time=2025-04-03T10:06:36.785+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.6509068540000005 model=/data/ollama/.ollama/models/blobs/sha256-a9d3d622b517bcf81>

user@ubuntu:~$ ollama --version
ollama version is 0.6.2

ollama running in jetson ,cuda 12.4.131 jetpack6.2 cudnn1.0 ubuntu 22.04
LLM is deepseek-llm:7b and embdding model is quentinz/bge-large-zh-v1.5:latest

<!-- gh-comment-id:2774198914 --> @radishlee commented on GitHub (Apr 3, 2025): user@ubuntu:~$ systemctl status ollama ● ollama.service - Ollama Service Loaded: loaded (/etc/systemd/system/ollama.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2025-04-02 13:10:43 CST; 20h ago Main PID: 1152 (ollama) Tasks: 30 (limit: 74767) Memory: 16.6G CPU: 8h 41min 34.743s CGroup: /system.slice/ollama.service ├─ 1152 /usr/local/bin/ollama serve └─2256926 /usr/local/bin/ollama runner --model /data/ollama/.ollama/models/blobs/sha256-60cfdbde0472c3b850493551288a152f0858a0d1974964d6925c2b908035db76 --ctx-size 16384 --batch-size 512 --n-gpu-layers 31 --threads 8 --parallel 4 --> 4月 03 10:01:44 ubuntu ollama[1152]: llama_kv_cache_init: CPU KV buffer size = 7680.00 MiB 4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: KV self size = 7680.00 MiB, K (f16): 3840.00 MiB, V (f16): 3840.00 MiB 4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: CPU output buffer size = 1.62 MiB 4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: CPU compute buffer size = 1088.01 MiB 4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: graph nodes = 966 4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: graph splits = 1 4月 03 10:01:44 ubuntu ollama[1152]: time=2025-04-03T10:01:44.733+08:00 level=INFO source=server.go:619 msg="llama runner started in 2.52 seconds" 4月 03 10:06:36 ubuntu ollama[1152]: time=2025-04-03T10:06:36.330+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.195915848 model=/data/ollama/.ollama/models/blobs/sha256-a9d3d622b517bcf8150341f0> 4月 03 10:06:36 ubuntu ollama[1152]: time=2025-04-03T10:06:36.536+08:00 level=WARN source=sched.go:647 msg="**gpu VRAM usage didn't recover within timeout**" seconds=5.40198618 model=/data/ollama/.ollama/models/blobs/sha256-a9d3d622b517bcf8150341f07> 4月 03 10:06:36 ubuntu ollama[1152]: time=2025-04-03T10:06:36.785+08:00 level=WARN source=sched.go:647 msg="**gpu VRAM usage didn't recover within timeout**" seconds=5.6509068540000005 model=/data/ollama/.ollama/models/blobs/sha256-a9d3d622b517bcf81> user@ubuntu:~$ ollama --version ollama version is **0.6.2** ollama running in jetson ,cuda 12.4.131 jetpack6.2 cudnn1.0 ubuntu 22.04 LLM is deepseek-llm:7b and embdding model is quentinz/bge-large-zh-v1.5:latest
Author
Owner

@talsan74 commented on GitHub (Apr 3, 2025):

I had the same issue while running llama3.2-vision:90b. I freed up more than 50GB of storage, and it worked fine

<!-- gh-comment-id:2775225033 --> @talsan74 commented on GitHub (Apr 3, 2025): I had the same issue while running llama3.2-vision:90b. I freed up more than 50GB of storage, and it worked fine
Author
Owner

@yorjaggy commented on GitHub (Jul 22, 2025):

I had the same issue while running llama3.2-vision:90b. I freed up more than 50GB of storage, and it worked fine

Actually this save my day. I just double checked with df -h the amount of free space and after cleaning at least 200gb (im running DeepSeek-V3-Q2_K_XS:latest) the model start working properly 🚀

<!-- gh-comment-id:3103929386 --> @yorjaggy commented on GitHub (Jul 22, 2025): > I had the same issue while running llama3.2-vision:90b. I freed up more than 50GB of storage, and it worked fine Actually this save my day. I just double checked with df -h the amount of free space and after cleaning at least 200gb (im running DeepSeek-V3-Q2_K_XS:latest) the model start working properly 🚀
Author
Owner

@andtewfox commented on GitHub (Jul 22, 2025):

What is the issue?

I have already downloaded qwen:7b, but when i run ollama run qwen:7b,got this error Error: timed out waiting for llama runner to start:, in the server.log have this msg gpu VRAM usage didn't recover within timeout

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.1.37i think you need 16gb of ram for ollama

<!-- gh-comment-id:3104007497 --> @andtewfox commented on GitHub (Jul 22, 2025): > ### What is the issue? > I have already downloaded qwen:7b, but when i run `ollama run qwen:7b`,got this error `Error: timed out waiting for llama runner to start:`, in the server.log have this msg `gpu VRAM usage didn't recover within timeout` > > ### OS > Windows > > ### GPU > Nvidia > > ### CPU > Intel > > ### Ollama version > ollama version is 0.1.37i think you need 16gb of ram for ollama
Author
Owner

@pluberd commented on GitHub (Aug 4, 2025):

Same here. My Version is 0.10.1 and I habe 24/96GB VRAM/RAM

While most LLMs can be loaded, some don't work:

jobautomation/OpenEuroLLM-German:latest
gemma3n:e4b-it-q4_K_M
gemma3n:e4b-it-q8_0

<!-- gh-comment-id:3150014901 --> @pluberd commented on GitHub (Aug 4, 2025): Same here. My Version is 0.10.1 and I habe 24/96GB VRAM/RAM While most LLMs can be loaded, some don't work: jobautomation/OpenEuroLLM-German:latest gemma3n:e4b-it-q4_K_M gemma3n:e4b-it-q8_0
Author
Owner

@Oruli commented on GitHub (Sep 22, 2025):

I get this same issue on ollama version is 0.12.0, was it ever resolved?

It only happens after the model is unloaded when keep_alive is reached, so basically I can have one chat, come back 10minutes later and ollama doesn't load to GPU but CPU.

level=INFO source=server.go:672 msg="loading model" "model layers"=25 requested=-1
cuda driver library failed to get device context 800time=2025-09-22T13:24:00.748Z level=WARN source=gpu.go:453 msg="error looking up nvidia GPU memory"

msg="gpu VRAM usage didn't recover within timeout" seconds=5.519832576 runner.size="13.3 GiB" runner.vram="13.3 GiB" runner.parallel=1 runner.pid=300 runner.model=/root/.ollama/

ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
<!-- gh-comment-id:3319079026 --> @Oruli commented on GitHub (Sep 22, 2025): I get this same issue on ollama version is 0.12.0, was it ever resolved? It only happens after the model is unloaded when keep_alive is reached, so basically I can have one chat, come back 10minutes later and ollama doesn't load to GPU but CPU. ``` level=INFO source=server.go:672 msg="loading model" "model layers"=25 requested=-1 cuda driver library failed to get device context 800time=2025-09-22T13:24:00.748Z level=WARN source=gpu.go:453 msg="error looking up nvidia GPU memory" msg="gpu VRAM usage didn't recover within timeout" seconds=5.519832576 runner.size="13.3 GiB" runner.vram="13.3 GiB" runner.parallel=1 runner.pid=300 runner.model=/root/.ollama/ ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64802