[GH-ISSUE #4098] llama 70b takes 5.5 min to load on A100 #2546

Closed
opened 2026-04-12 12:52:06 -05:00 by GiteaMirror · 27 comments
Owner

Originally created by @rohidas-delcu on GitHub (May 2, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4098

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I've installed the model in the Ollama Docker pod successfully. However, when attempting to execute a query, there seems to be an issue. I've tried running "ollama run llama3:instruct," but the spinner just keeps spinning.

Here's a breakdown of the steps I've taken:

  • Executed the command to install the llama3 model:
ollama run llama3:instruct
  • After the installation completed, I immediately tried asking a question, but received no response. I waited for a considerable amount of time, but nothing changed.
  • I also attempted to run a curl command inside the pod, but encountered the same issue. The command seemed to get stuck, with no response.
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "What color is the sky at different times of the day?"
}'
  • I even checked the logs of the pod, but unfortunately, I didn't come across any helpful information.

image

image

image

Note: The server appeared to be up and listening on port 11434.

OS

Linux

GPU

No response

CPU

No response

Ollama version

llama3:instruct

Originally created by @rohidas-delcu on GitHub (May 2, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4098 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I've installed the model in the Ollama Docker pod successfully. However, when attempting to execute a query, there seems to be an issue. I've tried running "ollama run llama3:instruct," but the spinner just keeps spinning. Here's a breakdown of the steps I've taken: - Executed the command to install the llama3 model: ``` ollama run llama3:instruct ``` - After the installation completed, I immediately tried asking a question, but received no response. I waited for a considerable amount of time, but nothing changed. - I also attempted to run a curl command inside the pod, but encountered the same issue. The command seemed to get stuck, with no response. ``` curl http://localhost:11434/api/generate -d '{ "model": "llama3", "prompt": "What color is the sky at different times of the day?" }' ``` - I even checked the logs of the pod, but unfortunately, I didn't come across any helpful information. ![image](https://github.com/ollama/ollama/assets/112613598/37be3d25-7482-43c2-82a3-192ec7c5d500) ![image](https://github.com/ollama/ollama/assets/112613598/aa85c458-b724-4707-a9f6-964bf53e5c9f) ![image](https://github.com/ollama/ollama/assets/112613598/040a84ff-e18b-4166-b517-b119b68f3080) Note: The server appeared to be up and listening on port 11434. ### OS Linux ### GPU _No response_ ### CPU _No response_ ### Ollama version llama3:instruct
GiteaMirror added the dockerbugnvidia labels 2026-04-12 12:52:06 -05:00
Author
Owner

@dhiltgen commented on GitHub (May 2, 2024):

Can you share your server log? Do you have a GPU, or is this running in CPU mode?

<!-- gh-comment-id:2091131007 --> @dhiltgen commented on GitHub (May 2, 2024): Can you share your server log? Do you have a GPU, or is this running in CPU mode?
Author
Owner

@rohidas-delcu commented on GitHub (May 3, 2024):

@dhiltgen,

Logs:
image

Pod Metrics:
image

<!-- gh-comment-id:2092078078 --> @rohidas-delcu commented on GitHub (May 3, 2024): @dhiltgen, Logs: ![image](https://github.com/ollama/ollama/assets/112613598/9a4767a7-1490-4b09-a111-d642836e7c34) Pod Metrics: ![image](https://github.com/ollama/ollama/assets/112613598/d2ae8992-8833-491a-96db-67208895c6a5)
Author
Owner

@MarkoSagadin commented on GitHub (May 3, 2024):

Hello,

I am facing the exact same issue from today. For me this is present in the Docker images of Ollama, both v0.1.32 and v0.1.33. I am running images inside VastAI instances.

I can successfully pull the model, but if I try to run it it hangs. It doesn't matter if it is through CLI or via HTTP API, the result is the same.

However I noticed some differences.

For example, the above never happens on the Vast instances with RTX 4090 GPU, there it works as expected.
But it hangs on more powerful ones, for example A100 or H100. Canceling and repeating run command doesn't change anything.

I have recorder the logs from the Ollama server when I run the ollama run command.

Good response on RTX 4090 with gemma:7b-instruct-v1.1-q4_0
root@C.10747901:~$ ollama run gemma:7b-instruct-v1.1-q4_0
time=2024-05-03T19:50:19.301Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
time=2024-05-03T19:50:19.303Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama3318894424/runners/cuda_v11/libcudart.so.11.0 count=2
time=2024-05-03T19:50:19.303Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
_ time=2024-05-03T19:50:21.101Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="23823.2 MiB" memory.required.full="6408.9 MiB" memory.required.partial=
"6408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="
1127.2 MiB"
time=2024-05-03T19:50:21.102Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="23823.2 MiB" memory.required.full="6408.9 MiB" memory.required.partial="6
408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="11
27.2 MiB"
time=2024-05-03T19:50:21.102Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-03T19:50:21.102Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama3318894424/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-ef311de6af9d
b043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 1 --port 40769"
time=2024-05-03T19:50:21.103Z level=INFO source=sched.go:340 msg="loaded runners" count=1
time=2024-05-03T19:50:21.103Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140553671749632","timestamp":1714765821}
{"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140553671749632","timestamp":1714765821}
{"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":127,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA =
 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140553671749632","timestamp":1714765821,"total_thr
eads":255}
llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from /root/.ollama/models/blobs/sha256-ef311de6af9db043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-1.1-7b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   4:                          gemma.block_count u32              = 28
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
_ llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
_ llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   57 tensors
llama_model_loader: - type q4_0:  196 tensors
llama_model_loader: - type q6_K:    1 tensors
_ llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_rot            = 192
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 24576
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.54 B
llm_load_print_meta: model size       = 4.66 GiB (4.69 BPW)
llm_load_print_meta: general.name     = gemma-1.1-7b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
_ llm_load_tensors: ggml ctx size =    0.26 MiB
_ llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   615.23 MiB
llm_load_tensors:      CUDA0 buffer size =  4773.90 MiB
_ .
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   896.00 MiB
llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   506.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 931
llama_new_context_with_model: graph splits = 2
_ {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140553671749632","timestamp":1714765823}
{"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140553671749632","timestamp":1714765823}
{"function":"main","level":"INFO","line":3067,"msg":"model loaded","tid":"140553671749632","timestamp":1714765823}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3270,"msg":"HTTP server listening","n_threads_http":"254","port":"40769","tid":"140553671749632","timestamp":1714765823}
{"function":"update_slots","level":"INFO","line":1581,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"140553671749632","timestamp":1714765823}
{"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":0,"tid":"140553671749632","timestamp":1714765823}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":36898,"status":200,"tid":"140550372052992","time
stamp":1714765823}

I have cut the end of the above log, it successfully ends with a ready to use prompt line.

But on the other hand the below two hang indefinitely:

Bad run on H100 with gemma:7b-instruct-v1.1-q4_0
root@C.10747901:~$ ollama run gemma:7b-instruct-v1.1-q4_0
[GIN] 2024/05/03 - 20:02:19 | 200 |      25.739_s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/05/03 - 20:02:19 | 200 |     519.066_s |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/05/03 - 20:02:19 | 200 |     821.282_s |       127.0.0.1 | POST     "/api/show"
time=2024-05-03T20:02:19.037Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
time=2024-05-03T20:02:19.043Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1224074478/runners/cuda_v11/libcudart.so.11.0 count=2
time=2024-05-03T20:02:19.043Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
_ time=2024-05-03T20:02:20.616Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="80482.9 MiB" memory.required.full="6408.9 MiB" memory.required.partial=
"6408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="
1127.2 MiB"
time=2024-05-03T20:02:20.617Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="80482.9 MiB" memory.required.full="6408.9 MiB" memory.required.partial="6
408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="11
27.2 MiB"
time=2024-05-03T20:02:20.617Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-03T20:02:20.617Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1224074478/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-ef311de6af9d
b043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 1 --port 46235"
time=2024-05-03T20:02:20.618Z level=INFO source=sched.go:340 msg="loaded runners" count=1
time=2024-05-03T20:02:20.618Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140132691918848","timestamp":1714766540}
{"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140132691918848","timestamp":1714766540}
{"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":112,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA =
 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140132691918848","timestamp":1714766540,"total_thr
eads":224}
_ llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from /root/.ollama/models/blobs/sha256-ef311de6af9db043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 (version GGUF V3 (latest)
)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-1.1-7b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   4:                          gemma.block_count u32              = 28
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
_ llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
_ llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   57 tensors
llama_model_loader: - type q4_0:  196 tensors
llama_model_loader: - type q6_K:    1 tensors
_ llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_rot            = 192
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 24576
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.54 B
llm_load_print_meta: model size       = 4.66 GiB (4.69 BPW)
llm_load_print_meta: general.name     = gemma-1.1-7b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Bad run on H100 with llama2:70b-chat-q4_0
root@C.10747901:~$ ollama run llama2:70b-chat-q4_0
[GIN] 2024/05/03 - 20:11:07 | 200 |      25.188_s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/05/03 - 20:11:07 | 200 |     478.553_s |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/05/03 - 20:11:07 | 200 |     262.277_s |       127.0.0.1 | POST     "/api/show"
time=2024-05-03T20:11:07.545Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
time=2024-05-03T20:11:07.553Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1224074478/runners/cuda_v11/libcudart.so.11.0 count=2
time=2024-05-03T20:11:07.553Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
_ time=2024-05-03T20:11:08.354Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=81 memory.available="80482.9 MiB" memory.required.full="38351.1 MiB" memory.required.partial
="38351.1 MiB" memory.required.kv="640.0 MiB" memory.weights.total="36930.1 MiB" memory.weights.repeating="36725.0 MiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.parti
al="348.0 MiB"
time=2024-05-03T20:11:08.356Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=81 memory.available="80482.9 MiB" memory.required.full="38351.1 MiB" memory.required.partial="
38351.1 MiB" memory.required.kv="640.0 MiB" memory.weights.total="36930.1 MiB" memory.weights.repeating="36725.0 MiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.partial
="348.0 MiB"
time=2024-05-03T20:11:08.356Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-03T20:11:08.357Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1224074478/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-68bbe6dc9cf4
2eb60c9a7f96137fb8d472f752de6ebf53e9942f267f1a1e2577 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --parallel 1 --port 44497"
time=2024-05-03T20:11:08.357Z level=INFO source=sched.go:340 msg="loaded runners" count=1
time=2024-05-03T20:11:08.357Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140574146117632","timestamp":1714767068}
{"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140574146117632","timestamp":1714767068}
{"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":112,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA =
 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140574146117632","timestamp":1714767068,"total_thr
eads":224}
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from /root/.ollama/models/blobs/sha256-68bbe6dc9cf42eb60c9a7f96137fb8d472f752de6ebf53e9942f267f1a1e2577 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,61249]   = ["_ t", "e r", "i n", "_ a", "e n...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  561 tensors
llama_model_loader: - type q6_K:    1 tensors
_ llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 36.20 GiB (4.51 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
_ llm_load_tensors: ggml ctx size =    0.74 MiB
_ llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:        CPU buffer size =   140.62 MiB
llm_load_tensors:      CUDA0 buffer size = 36930.11 MiB
_ .
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   640.00 MiB
llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
_ llama_new_context_with_model:  CUDA_Host  output buffer size =     0.15 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   324.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    20.01 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 2
<!-- gh-comment-id:2093723948 --> @MarkoSagadin commented on GitHub (May 3, 2024): Hello, I am facing the exact same issue from today. For me this is present in the Docker images of Ollama, both `v0.1.32` and `v0.1.33`. I am running images inside VastAI instances. I can successfully pull the model, but if I try to run it it hangs. It doesn't matter if it is through CLI or via HTTP API, the result is the same. However I noticed some differences. For example, the above never happens on the Vast instances with RTX 4090 GPU, there it works as expected. But it hangs on more powerful ones, for example A100 or H100. Canceling and repeating run command doesn't change anything. I have recorder the logs from the Ollama server when I run the `ollama run` command. <details><summary>Good response on RTX 4090 with gemma:7b-instruct-v1.1-q4_0</summary> ``` root@C.10747901:~$ ollama run gemma:7b-instruct-v1.1-q4_0 time=2024-05-03T19:50:19.301Z level=INFO source=gpu.go:96 msg="Detecting GPUs" time=2024-05-03T19:50:19.303Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama3318894424/runners/cuda_v11/libcudart.so.11.0 count=2 time=2024-05-03T19:50:19.303Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" _ time=2024-05-03T19:50:21.101Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="23823.2 MiB" memory.required.full="6408.9 MiB" memory.required.partial= "6408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial=" 1127.2 MiB" time=2024-05-03T19:50:21.102Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="23823.2 MiB" memory.required.full="6408.9 MiB" memory.required.partial="6 408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="11 27.2 MiB" time=2024-05-03T19:50:21.102Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-05-03T19:50:21.102Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama3318894424/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-ef311de6af9d b043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 1 --port 40769" time=2024-05-03T19:50:21.103Z level=INFO source=sched.go:340 msg="loaded runners" count=1 time=2024-05-03T19:50:21.103Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140553671749632","timestamp":1714765821} {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140553671749632","timestamp":1714765821} {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":127,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140553671749632","timestamp":1714765821,"total_thr eads":255} llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from /root/.ollama/models/blobs/sha256-ef311de6af9db043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-1.1-7b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072 llama_model_loader: - kv 4: gemma.block_count u32 = 28 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama _ llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... _ llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 57 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q6_K: 1 tensors _ llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 192 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 24576 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.54 B llm_load_print_meta: model size = 4.66 GiB (4.69 BPW) llm_load_print_meta: general.name = gemma-1.1-7b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes _ llm_load_tensors: ggml ctx size = 0.26 MiB _ llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU buffer size = 615.23 MiB llm_load_tensors: CUDA0 buffer size = 4773.90 MiB _ . llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB llama_new_context_with_model: CUDA0 compute buffer size = 506.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB llama_new_context_with_model: graph nodes = 931 llama_new_context_with_model: graph splits = 2 _ {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140553671749632","timestamp":1714765823} {"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140553671749632","timestamp":1714765823} {"function":"main","level":"INFO","line":3067,"msg":"model loaded","tid":"140553671749632","timestamp":1714765823} {"function":"main","hostname":"127.0.0.1","level":"INFO","line":3270,"msg":"HTTP server listening","n_threads_http":"254","port":"40769","tid":"140553671749632","timestamp":1714765823} {"function":"update_slots","level":"INFO","line":1581,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"140553671749632","timestamp":1714765823} {"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":0,"tid":"140553671749632","timestamp":1714765823} {"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":36898,"status":200,"tid":"140550372052992","time stamp":1714765823} ``` </details> I have cut the end of the above log, it successfully ends with a ready to use prompt line. But on the other hand the below two hang indefinitely: <details><summary>Bad run on H100 with gemma:7b-instruct-v1.1-q4_0</summary> ``` root@C.10747901:~$ ollama run gemma:7b-instruct-v1.1-q4_0 [GIN] 2024/05/03 - 20:02:19 | 200 | 25.739_s | 127.0.0.1 | HEAD "/" [GIN] 2024/05/03 - 20:02:19 | 200 | 519.066_s | 127.0.0.1 | POST "/api/show" [GIN] 2024/05/03 - 20:02:19 | 200 | 821.282_s | 127.0.0.1 | POST "/api/show" time=2024-05-03T20:02:19.037Z level=INFO source=gpu.go:96 msg="Detecting GPUs" time=2024-05-03T20:02:19.043Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1224074478/runners/cuda_v11/libcudart.so.11.0 count=2 time=2024-05-03T20:02:19.043Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" _ time=2024-05-03T20:02:20.616Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="80482.9 MiB" memory.required.full="6408.9 MiB" memory.required.partial= "6408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial=" 1127.2 MiB" time=2024-05-03T20:02:20.617Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="80482.9 MiB" memory.required.full="6408.9 MiB" memory.required.partial="6 408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="11 27.2 MiB" time=2024-05-03T20:02:20.617Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-05-03T20:02:20.617Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1224074478/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-ef311de6af9d b043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 1 --port 46235" time=2024-05-03T20:02:20.618Z level=INFO source=sched.go:340 msg="loaded runners" count=1 time=2024-05-03T20:02:20.618Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140132691918848","timestamp":1714766540} {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140132691918848","timestamp":1714766540} {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":112,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140132691918848","timestamp":1714766540,"total_thr eads":224} _ llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from /root/.ollama/models/blobs/sha256-ef311de6af9db043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 (version GGUF V3 (latest) ) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-1.1-7b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072 llama_model_loader: - kv 4: gemma.block_count u32 = 28 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... _ llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... _ llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 57 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q6_K: 1 tensors _ llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 192 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 24576 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.54 B llm_load_print_meta: model size = 4.66 GiB (4.69 BPW) llm_load_print_meta: general.name = gemma-1.1-7b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes ``` </details> <details><summary>Bad run on H100 with llama2:70b-chat-q4_0</summary> ``` root@C.10747901:~$ ollama run llama2:70b-chat-q4_0 [GIN] 2024/05/03 - 20:11:07 | 200 | 25.188_s | 127.0.0.1 | HEAD "/" [GIN] 2024/05/03 - 20:11:07 | 200 | 478.553_s | 127.0.0.1 | POST "/api/show" [GIN] 2024/05/03 - 20:11:07 | 200 | 262.277_s | 127.0.0.1 | POST "/api/show" time=2024-05-03T20:11:07.545Z level=INFO source=gpu.go:96 msg="Detecting GPUs" time=2024-05-03T20:11:07.553Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1224074478/runners/cuda_v11/libcudart.so.11.0 count=2 time=2024-05-03T20:11:07.553Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" _ time=2024-05-03T20:11:08.354Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=81 memory.available="80482.9 MiB" memory.required.full="38351.1 MiB" memory.required.partial ="38351.1 MiB" memory.required.kv="640.0 MiB" memory.weights.total="36930.1 MiB" memory.weights.repeating="36725.0 MiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.parti al="348.0 MiB" time=2024-05-03T20:11:08.356Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=81 memory.available="80482.9 MiB" memory.required.full="38351.1 MiB" memory.required.partial=" 38351.1 MiB" memory.required.kv="640.0 MiB" memory.weights.total="36930.1 MiB" memory.weights.repeating="36725.0 MiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.partial ="348.0 MiB" time=2024-05-03T20:11:08.356Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-05-03T20:11:08.357Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1224074478/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-68bbe6dc9cf4 2eb60c9a7f96137fb8d472f752de6ebf53e9942f267f1a1e2577 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --parallel 1 --port 44497" time=2024-05-03T20:11:08.357Z level=INFO source=sched.go:340 msg="loaded runners" count=1 time=2024-05-03T20:11:08.357Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140574146117632","timestamp":1714767068} {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140574146117632","timestamp":1714767068} {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":112,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140574146117632","timestamp":1714767068,"total_thr eads":224} llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from /root/.ollama/models/blobs/sha256-68bbe6dc9cf42eb60c9a7f96137fb8d472f752de6ebf53e9942f267f1a1e2577 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 8192 llama_model_loader: - kv 4: llama.block_count u32 = 80 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 64 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,61249] = ["_ t", "e r", "i n", "_ a", "e n... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q4_0: 561 tensors llama_model_loader: - type q6_K: 1 tensors _ llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 68.98 B llm_load_print_meta: model size = 36.20 GiB (4.51 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes _ llm_load_tensors: ggml ctx size = 0.74 MiB _ llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: CPU buffer size = 140.62 MiB llm_load_tensors: CUDA0 buffer size = 36930.11 MiB _ . llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 640.00 MiB llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB _ llama_new_context_with_model: CUDA_Host output buffer size = 0.15 MiB llama_new_context_with_model: CUDA0 compute buffer size = 324.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 2566 llama_new_context_with_model: graph splits = 2 ``` </details>
Author
Owner

@dhiltgen commented on GitHub (May 4, 2024):

@MarkoSagadin when you say "hang indefinitely" can you clarify how long we're talking? We've seen some cloud instances have quite slow I/O and can take a very long time to load models. Some users are reporting our default timeout of 10m isn't sufficient to load models on some setups. Does it eventually hit a model load timeout error at 10m?

<!-- gh-comment-id:2094375183 --> @dhiltgen commented on GitHub (May 4, 2024): @MarkoSagadin when you say "hang indefinitely" can you clarify how long we're talking? We've seen some cloud instances have quite slow I/O and can take a very long time to load models. Some users are reporting our default timeout of 10m isn't sufficient to load models on some setups. Does it eventually hit a model load timeout error at 10m?
Author
Owner

@MarkoSagadin commented on GitHub (May 5, 2024):

@dhiltgen it hung for at least 3 minutes, I had timeout for httpx requests inside my program set to this.

Longer I can't say (and it will be a while before I can test this, since I will be out of the office for next week).

For me it is weird that on RTX 4090 GPU running models seems to function, but not on H100 GPU, and that actually worked fine two days ago.

<!-- gh-comment-id:2094650013 --> @MarkoSagadin commented on GitHub (May 5, 2024): @dhiltgen it hung for at least 3 minutes, I had timeout for httpx requests inside my program set to this. Longer I can't say (and it will be a while before I can test this, since I will be out of the office for next week). For me it is weird that on RTX 4090 GPU running models seems to function, but not on H100 GPU, and that actually worked fine two days ago.
Author
Owner

@StrikerRUS commented on GitHub (May 14, 2024):

Sorry for offtop. @rk-spirinova You'd better hide your public and unprotected ip from your screenshot.

<!-- gh-comment-id:2110164381 --> @StrikerRUS commented on GitHub (May 14, 2024): Sorry for offtop. @rk-spirinova You'd better hide your public and unprotected ip from your screenshot.
Author
Owner

@MarkoSagadin commented on GitHub (May 15, 2024):

@dhiltgen I have made some new discoveries on this issue.

  • The issue is present on the H100, on docker versions 0.1.38, 0.1.37, 0.1.30 and 0.1.28 (I was interested to see how back I would need to go for this to work).
  • The issue is not present on RTX 4090 with above versions.

I let the /api/generate API (using llama3 model) to hang, to see if I would get to the 10 min timeout that you mentioned. To my surprise the model actually executed my request after 5 minutes and 30 seconds. Subsequent runs with the same model or a different one (phi3) were instantaneous.

This is a workaround that I can work with, if the hanging delays don't come back when switching between models, I am evaluating many of them...

Is there any extra information that I can provide? Such as instance information, nvidia driver versions, Ollama logs, etc. to help solving this issue?

<!-- gh-comment-id:2113249135 --> @MarkoSagadin commented on GitHub (May 15, 2024): @dhiltgen I have made some new discoveries on this issue. * The issue is present on the H100, on docker versions 0.1.38, 0.1.37, 0.1.30 and 0.1.28 (I was interested to see how back I would need to go for this to work). * The issue is not present on RTX 4090 with above versions. I let the `/api/generate` API (using `llama3` model) to hang, to see if I would get to the 10 min timeout that you mentioned. To my surprise the model actually executed my request after 5 minutes and 30 seconds. Subsequent runs with the same model or a different one (`phi3`) were instantaneous. This is a workaround that I can work with, if the hanging delays don't come back when switching between models, I am evaluating many of them... Is there any extra information that I can provide? Such as instance information, nvidia driver versions, Ollama logs, etc. to help solving this issue?
Author
Owner

@dhiltgen commented on GitHub (May 16, 2024):

@MarkoSagadin that's great to hear it did actually load and wasn't hung.

Now that we know it's not stuck, and is more a performance problem, the question is if there's a bug in the ollama code leading this to be unnecessarily inefficient, or if your cloud instance needs optimization perhaps. You can try adjusting the storage device where the models are being stored to make sure it is high performance storage. Also try different instance types (CPUs, memory) - perhaps you're using an instance type that is getting "starved" for some resource at the hypervisor level resulting in slow I/O transfers into the GPU.

Is there any extra information that I can provide? Such as instance information, nvidia driver versions, Ollama logs, etc. to help solving this issue?

You might want to try to explore some metrics on your node to see if there's an obvious bottleneck leading to this slow performance. For example iostat -dmx 5

If none of that helps you find a good combination, can you share what cloud provider you're using, and the VM configuration details so we can try to repro?

<!-- gh-comment-id:2113674415 --> @dhiltgen commented on GitHub (May 16, 2024): @MarkoSagadin that's great to hear it did actually load and wasn't hung. Now that we know it's not stuck, and is more a performance problem, the question is if there's a bug in the ollama code leading this to be unnecessarily inefficient, or if your cloud instance needs optimization perhaps. You can try adjusting the storage device where the models are being stored to make sure it is high performance storage. Also try different instance types (CPUs, memory) - perhaps you're using an instance type that is getting "starved" for some resource at the hypervisor level resulting in slow I/O transfers into the GPU. > Is there any extra information that I can provide? Such as instance information, nvidia driver versions, Ollama logs, etc. to help solving this issue? You might want to try to explore some metrics on your node to see if there's an obvious bottleneck leading to this slow performance. For example `iostat -dmx 5` If none of that helps you find a good combination, can you share what cloud provider you're using, and the VM configuration details so we can try to repro?
Author
Owner

@pdevine commented on GitHub (May 18, 2024):

I just tried this with a 2xA100 on Ubuntu 22.04 and everything is working correctly:

$ ollama ps
NAME           	ID          	SIZE  	PROCESSOR	UNTIL
llama3:instruct	a6990ed6be41	5.4 GB	100% GPU 	4 minutes from now

and the output from verbose:

total duration:       6.791328594s
load duration:        4.626741228s
prompt eval count:    189 token(s)
prompt eval duration: 180.487ms
prompt eval rate:     1047.17 tokens/s
eval count:           177 token(s)
eval duration:        1.839261s
eval rate:            96.23 tokens/s

This is w/ ollama 0.1.38.

Results for gemma:7b-instruct-v1.1-q4_0:

total duration:       1.004777727s
load duration:        2.342302ms
prompt eval count:    18 token(s)
prompt eval duration: 130.858ms
prompt eval rate:     137.55 tokens/s
eval count:           74 token(s)
eval duration:        739.935ms
eval rate:            100.01 tokens/s

And for llama2:70b-chat-q4_0:

total duration:       25.950922105s
load duration:        2.576583ms
prompt eval count:    29 token(s)
prompt eval duration: 587.475ms
prompt eval rate:     49.36 tokens/s
eval count:           612 token(s)
eval duration:        25.229529s
eval rate:            24.26 tokens/s
<!-- gh-comment-id:2118970265 --> @pdevine commented on GitHub (May 18, 2024): I just tried this with a 2xA100 on Ubuntu 22.04 and everything is working correctly: ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3:instruct a6990ed6be41 5.4 GB 100% GPU 4 minutes from now ``` and the output from verbose: ``` total duration: 6.791328594s load duration: 4.626741228s prompt eval count: 189 token(s) prompt eval duration: 180.487ms prompt eval rate: 1047.17 tokens/s eval count: 177 token(s) eval duration: 1.839261s eval rate: 96.23 tokens/s ``` This is w/ ollama `0.1.38`. Results for `gemma:7b-instruct-v1.1-q4_0`: ``` total duration: 1.004777727s load duration: 2.342302ms prompt eval count: 18 token(s) prompt eval duration: 130.858ms prompt eval rate: 137.55 tokens/s eval count: 74 token(s) eval duration: 739.935ms eval rate: 100.01 tokens/s ``` And for `llama2:70b-chat-q4_0`: ``` total duration: 25.950922105s load duration: 2.576583ms prompt eval count: 29 token(s) prompt eval duration: 587.475ms prompt eval rate: 49.36 tokens/s eval count: 612 token(s) eval duration: 25.229529s eval rate: 24.26 tokens/s ```
Author
Owner

@pdevine commented on GitHub (May 19, 2024):

@rk-spirinova can you update to 0.1.38 and try again? Also, what version of Linux are you running?

<!-- gh-comment-id:2119041346 --> @pdevine commented on GitHub (May 19, 2024): @rk-spirinova can you update to `0.1.38` and try again? Also, what version of Linux are you running?
Author
Owner

@hekmon commented on GitHub (May 24, 2024):

I had the same issue:

  • working great on a 3090
  • big performance issue on a H100
    • 5min30 too to load a model (even a small embeddings model)
    • inference was painfully slow (between 1s and 8s)

I decided to try to do a local build thanks to https://github.com/ollama/ollama/issues/4131#issuecomment-2097813973 :

  • updated CUDA toolkit from 12.4 to 12.5 (driver from 550 to 555)
  • checked out the v0.1.39 prerelease version
  • build ollama with the machine own updated CUDA
  • started ollama with the new OLLAMA_FLASH_ATTENTION=1 (this might not have to do anything with resolution but as I set it up, I report it anyway)

Everything is running super smooth now. My guess is that there is some issue for the A100/H100 cards with the CUDA version used by ollama for the official builds (binary and docker).

<!-- gh-comment-id:2129593670 --> @hekmon commented on GitHub (May 24, 2024): I had the same issue: * working great on a 3090 * big performance issue on a H100 * 5min30 too to load a model (even a small embeddings model) * inference was painfully slow (between 1s and 8s) I decided to try to do a local build thanks to https://github.com/ollama/ollama/issues/4131#issuecomment-2097813973 : * updated [CUDA toolkit](https://developer.nvidia.com/cuda-downloads) from 12.4 to 12.5 ([driver](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/?_gl=1*71mrnc*_gcl_au*MTE3NzM4MzQzMC4xNzE2MjM3ODkw#switching-between-driver-module-flavors) from 550 to 555) * checked out the [v0.1.39](https://github.com/ollama/ollama/releases/tag/v0.1.39) prerelease version * [build](https://github.com/ollama/ollama/blob/main/docs/development.md#linux) ollama with the machine own updated CUDA * started ollama with the new `OLLAMA_FLASH_ATTENTION=1` (this might not have to do anything with resolution but as I set it up, I report it anyway) Everything is running super smooth now. My guess is that there is some issue for the A100/H100 cards with the CUDA version used by ollama for the official builds (binary and docker).
Author
Owner

@pdevine commented on GitHub (May 24, 2024):

@hekmon what version of Linux are you using and are you using this in the cloud somewhere? I just want to see if I can duplicate the issue.

<!-- gh-comment-id:2129974613 --> @pdevine commented on GitHub (May 24, 2024): @hekmon what version of Linux are you using and are you using this in the cloud somewhere? I just want to see if I can duplicate the issue.
Author
Owner

@hekmon commented on GitHub (May 24, 2024):

This is a baremetal server unfortunatly, not a cloud instance. But I am confident that the issue lies within the A100/H100 and the CUDA version used to compile ollama, therfore any Ubuntu 22.04 with a A100/H100 should be able to replicate the issue.

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

Edit: just saw your previous comments where you told you tested it on Ubuntu 22.04 and 2xA100. That's a bummer. May be on on the model side then ? The 2 models I was using (with the experimental env var for multiples models to 4 and requests concurrency to 10) were:

  • mxbai for embeddings
  • deepseekcoder 6.7b (tried both quantized and fp16)
<!-- gh-comment-id:2130278890 --> @hekmon commented on GitHub (May 24, 2024): This is a baremetal server unfortunatly, not a cloud instance. But I am confident that the issue lies within the A100/H100 and the CUDA version used to compile ollama, therfore any Ubuntu 22.04 with a A100/H100 should be able to replicate the issue. ``` No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.4 LTS Release: 22.04 Codename: jammy ``` Edit: just saw your previous comments where you told you tested it on Ubuntu 22.04 and 2xA100. That's a bummer. May be on on the model side then ? The 2 models I was using (with the experimental env var for multiples models to 4 and requests concurrency to 10) were: - mxbai for embeddings - deepseekcoder 6.7b (tried both quantized and fp16)
Author
Owner

@pdevine commented on GitHub (May 24, 2024):

Edit: just saw your previous comments where you told you tested it on Ubuntu 22.04 and 2xA100. That's a bummer. May be on on the model side then ? The 2 models I was using (with the experimental env var for multiples models to 4 and requests concurrency to 10) were:

  • mxbai for embeddings
  • deepseekcoder 6.7b (tried both quantized and fp16)

It still can be a driver issue. I just used whatever drivers were available in the "ML-in-a-box" Ubuntu 22.04 image on Paperspace. I didn't try from scratch. I'll give that a shot soon.

<!-- gh-comment-id:2130344353 --> @pdevine commented on GitHub (May 24, 2024): > Edit: just saw your previous comments where you told you tested it on Ubuntu 22.04 and 2xA100. That's a bummer. May be on on the model side then ? The 2 models I was using (with the experimental env var for multiples models to 4 and requests concurrency to 10) were: > > * mxbai for embeddings > * deepseekcoder 6.7b (tried both quantized and fp16) It still can be a driver issue. I just used whatever drivers were available in the "ML-in-a-box" Ubuntu 22.04 image on Paperspace. I didn't try from scratch. I'll give that a shot soon.
Author
Owner

@pdevine commented on GitHub (May 27, 2024):

OK, this turns out to be an an nVidia problem where they updated the driver and it no longer loads the correct kernel modules.

#4652 fixes this, but the workaround is to run:

sudo modprobe nvidia
sudo modprobe nvidia_uvm

I'll go ahead and close out the issue.

<!-- gh-comment-id:2132672248 --> @pdevine commented on GitHub (May 27, 2024): OK, this turns out to be an an nVidia problem where they updated the driver and it no longer loads the correct kernel modules. #4652 fixes this, but the workaround is to run: ``` sudo modprobe nvidia sudo modprobe nvidia_uvm ``` I'll go ahead and close out the issue.
Author
Owner

@pdevine commented on GitHub (May 27, 2024):

Note, this is specifically w/ driver version 555. @hekmon if you do an ollama ps you can see that everything is loaded onto the CPU instead of the GPU, so it's not really the same problem that was initially reported (which I think was fixed before).

<!-- gh-comment-id:2132673567 --> @pdevine commented on GitHub (May 27, 2024): Note, this is specifically w/ driver version `555`. @hekmon if you do an `ollama ps` you can see that everything is loaded onto the CPU instead of the GPU, so it's not really the same problem that was initially reported (which I think was fixed before).
Author
Owner

@hekmon commented on GitHub (May 27, 2024):

That's odd.

I used to have driver 550 (updated my original post) and had the issue. I do have the 555 now but my localy compiled ollama correctly report GPU usage with ollama ps.

I will try to launch the original docker container of ollama 0.1.38 that I might still have unmodified to double check/confirm that it had the issue even with 555 while locally compiled ollama does not and report asap.

<!-- gh-comment-id:2133787574 --> @hekmon commented on GitHub (May 27, 2024): That's odd. I used to have driver 550 (updated my original post) and had the issue. I do have the 555 now but my localy compiled ollama correctly report GPU usage with `ollama ps`. I will try to launch the original docker container of ollama 0.1.38 that I might still have unmodified to double check/confirm that it had the issue even with 555 while locally compiled ollama does not and report asap.
Author
Owner

@hekmon commented on GitHub (May 28, 2024):

So I restarted my original ollama v0.1.38 container with the new CUDA/driver version. I do not have the 5min30 load time any more but it still does not work properly (symptoms are now closer to https://github.com/ollama/ollama/issues/4131):

  • Starting is ok
  • Then I try to load the mxbai embeddings model with the following curl : curl "http://127.0.0.1:11434/api/embeddings" -d '{"model": "mxbai-embed-large", "keep_alive": -1}'
  • The curl then blocks for 1min52 and returns {"error":"timed out waiting for llama runner to start - progress 1.00 - "}
  • During that time, a ollama ps clearly shows that GPU is in use (not CPU):
NAME                            ID              SIZE    PROCESSOR       UNTIL
mxbai-embed-large:latest        468836162de7    1.3 GB  100% GPU        Forever
  • After the error the model does not appear in ollama ps any more

ollama_v0.1.38.log

TL;DR
After the CUDA/driver upgrade, the symptoms changed (5min30 load time with an available model at the end -> model load timeout after ~2min) but ollama still does not work as expected. While a locally compiled ollama does work. In my opinion this has nothing to do with https://github.com/ollama/ollama/pull/4652 (my nvidia kernels modules are loaded).

<!-- gh-comment-id:2134829645 --> @hekmon commented on GitHub (May 28, 2024): So I restarted my original ollama v0.1.38 container with the new CUDA/driver version. I do not have the 5min30 load time any more but it still does not work properly (symptoms are now closer to https://github.com/ollama/ollama/issues/4131): * Starting is ok * Then I try to load the mxbai embeddings model with the following curl : `curl "http://127.0.0.1:11434/api/embeddings" -d '{"model": "mxbai-embed-large", "keep_alive": -1}'` * The curl then blocks for 1min52 and returns `{"error":"timed out waiting for llama runner to start - progress 1.00 - "}` * During that time, a ollama ps clearly shows that GPU is in use (not CPU): ``` NAME ID SIZE PROCESSOR UNTIL mxbai-embed-large:latest 468836162de7 1.3 GB 100% GPU Forever ``` * After the error the model does not appear in `ollama ps` any more [ollama_v0.1.38.log](https://github.com/ollama/ollama/files/15467477/ollama_v0.1.38.log) **TL;DR** After the CUDA/driver upgrade, the symptoms changed (5min30 load time with an available model at the end -> model load timeout after ~2min) but ollama still does not work as expected. While a locally compiled ollama does work. In my opinion this has nothing to do with https://github.com/ollama/ollama/pull/4652 (my nvidia kernels modules are loaded).
Author
Owner

@dhiltgen commented on GitHub (May 28, 2024):

I've made some recent changes to try to tighten up our timeouts on model loading but it looks like that new 1m timer might be too aggressive in some cases...

May 28 09:37:07 llmlab-01 docker[1758301]: time=2024-05-28T09:37:07.459Z level=DEBUG source=server.go:573 msg="model load progress 1.00"
May 28 09:38:07 llmlab-01 docker[1758301]: time=2024-05-28T09:38:07.494Z level=ERROR source=sched.go:344 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 1.00 - "

The underlying runner claims 100% loaded, but still doesn't come online a minute later.

<!-- gh-comment-id:2135528232 --> @dhiltgen commented on GitHub (May 28, 2024): I've made some recent changes to try to tighten up our timeouts on model loading but it looks like that new 1m timer might be too aggressive in some cases... ``` May 28 09:37:07 llmlab-01 docker[1758301]: time=2024-05-28T09:37:07.459Z level=DEBUG source=server.go:573 msg="model load progress 1.00" May 28 09:38:07 llmlab-01 docker[1758301]: time=2024-05-28T09:38:07.494Z level=ERROR source=sched.go:344 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 1.00 - " ``` The underlying runner claims 100% loaded, but still doesn't come online a minute later.
Author
Owner

@alonilon commented on GitHub (May 28, 2024):

Wow this thread is spot on !
Just had this issue today running ollama on openshift with gpu

I’ll update when I get to test this tomorrow morning!

<!-- gh-comment-id:2136167277 --> @alonilon commented on GitHub (May 28, 2024): Wow this thread is spot on ! Just had this issue today running ollama on openshift with gpu I’ll update when I get to test this tomorrow morning!
Author
Owner

@alonilon commented on GitHub (May 29, 2024):

Unfortunately after further testing today this does not seem to resolve the issue in my case

No further logs or information is given after the 5m timeout

Still waiting for the llama runner to start :(

<!-- gh-comment-id:2138332449 --> @alonilon commented on GitHub (May 29, 2024): Unfortunately after further testing today this does not seem to resolve the issue in my case No further logs or information is given after the 5m timeout Still waiting for the llama runner to start :(
Author
Owner

@dhiltgen commented on GitHub (May 29, 2024):

@alonilon can you set OLLAMA_DEBUG=1 and share the logs leading up to the timeout? I'm curious if it was stalled at 0%, 100% or some other behavior.

<!-- gh-comment-id:2138339831 --> @dhiltgen commented on GitHub (May 29, 2024): @alonilon can you set `OLLAMA_DEBUG=1` and share the logs leading up to the timeout? I'm curious if it was stalled at 0%, 100% or some other behavior.
Author
Owner

@alonilon commented on GitHub (May 30, 2024):

@dhiltgen
Unfortunately I cannot share the logs as the machine is disconnected from the internet

I can see the logs of the model being loaded all the way up to 1.00

Are there any other debug options I could/would try ?

Is there any way to debug the status of the runner itself ?

<!-- gh-comment-id:2140368689 --> @alonilon commented on GitHub (May 30, 2024): @dhiltgen Unfortunately I cannot share the logs as the machine is disconnected from the internet I can see the logs of the model being loaded all the way up to 1.00 Are there any other debug options I could/would try ? Is there any way to debug the status of the runner itself ?
Author
Owner

@dmikushin commented on GitHub (Feb 23, 2025):

With the latest Ollama I can confirm that the uploading time is extremely slow on A100 (please see below). The optimal CUDA host to device bandwidth should be 25 GB/sec, according to the bandwidthTest on the same machine. But maybe this is not an A100 problem. Another oddity is the disk, which is my case is fuse-overlayfs on top of XFS filesystem. I won't be surprised that this disk setup is not so well tested as native ext4.

llm_load_tensors: offloading 80 repeating layers to GPU                                                                                                      
llm_load_tensors: offloading output layer to GPU                                                                                                             
llm_load_tensors: offloaded 81/81 layers to GPU                                                                                                              
llm_load_tensors:    CUDA_Host model buffer size =  1064.62 MiB                                                                                              
llm_load_tensors:        CUDA0 model buffer size = 70429.66 MiB                                                                                              
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads                                                       
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0                                                                        
time=2025-02-24T00:22:14.131+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.01"                                                           
time=2025-02-24T00:22:20.923+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03"                                                           
time=2025-02-24T00:22:21.174+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03"                                                           
time=2025-02-24T00:22:21.425+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03"                                                           
time=2025-02-24T00:22:21.676+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03"                                                           
time=2025-02-24T00:22:22.682+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04"                                                           
time=2025-02-24T00:22:23.688+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04"                                                           
time=2025-02-24T00:22:24.694+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04"                                                           
time=2025-02-24T00:22:25.196+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04"                                                           
time=2025-02-24T00:22:25.448+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04"                                                           
time=2025-02-24T00:22:26.453+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05"                                                           
time=2025-02-24T00:22:27.459+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05"                                                           
time=2025-02-24T00:22:28.465+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05"                                                           
time=2025-02-24T00:22:28.967+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06"                                                           
time=2025-02-24T00:22:29.219+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06"
time=2025-02-24T00:22:30.226+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06"
time=2025-02-24T00:22:31.232+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06"
time=2025-02-24T00:22:32.489+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07"
time=2025-02-24T00:22:32.740+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07"
time=2025-02-24T00:22:32.991+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07"
time=2025-02-24T00:22:33.996+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07"
time=2025-02-24T00:22:35.252+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07"
time=2025-02-24T00:22:36.257+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08"
time=2025-02-24T00:22:36.508+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08"
time=2025-02-24T00:22:36.759+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08"
time=2025-02-24T00:22:37.010+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08"
time=2025-02-24T00:22:38.016+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08"
time=2025-02-24T00:22:39.021+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09"
time=2025-02-24T00:22:40.026+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09"
time=2025-02-24T00:22:40.277+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09"
time=2025-02-24T00:22:40.779+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09"
time=2025-02-24T00:22:41.784+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.10"
<!-- gh-comment-id:2676930494 --> @dmikushin commented on GitHub (Feb 23, 2025): With the latest Ollama I can confirm that the uploading time is extremely slow on A100 (please see below). The optimal CUDA host to device bandwidth should be 25 GB/sec, according to the bandwidthTest on the same machine. But maybe this is not an A100 problem. Another oddity is the disk, which is my case is **fuse-overlayfs on top of XFS filesystem**. I won't be surprised that this disk setup is not so well tested as native ext4. ``` llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: CUDA_Host model buffer size = 1064.62 MiB llm_load_tensors: CUDA0 model buffer size = 70429.66 MiB load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-02-24T00:22:14.131+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.01" time=2025-02-24T00:22:20.923+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03" time=2025-02-24T00:22:21.174+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03" time=2025-02-24T00:22:21.425+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03" time=2025-02-24T00:22:21.676+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03" time=2025-02-24T00:22:22.682+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04" time=2025-02-24T00:22:23.688+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04" time=2025-02-24T00:22:24.694+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04" time=2025-02-24T00:22:25.196+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04" time=2025-02-24T00:22:25.448+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04" time=2025-02-24T00:22:26.453+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05" time=2025-02-24T00:22:27.459+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05" time=2025-02-24T00:22:28.465+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05" time=2025-02-24T00:22:28.967+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06" time=2025-02-24T00:22:29.219+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06" time=2025-02-24T00:22:30.226+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06" time=2025-02-24T00:22:31.232+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06" time=2025-02-24T00:22:32.489+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07" time=2025-02-24T00:22:32.740+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07" time=2025-02-24T00:22:32.991+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07" time=2025-02-24T00:22:33.996+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07" time=2025-02-24T00:22:35.252+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07" time=2025-02-24T00:22:36.257+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08" time=2025-02-24T00:22:36.508+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08" time=2025-02-24T00:22:36.759+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08" time=2025-02-24T00:22:37.010+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08" time=2025-02-24T00:22:38.016+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08" time=2025-02-24T00:22:39.021+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09" time=2025-02-24T00:22:40.026+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09" time=2025-02-24T00:22:40.277+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09" time=2025-02-24T00:22:40.779+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09" time=2025-02-24T00:22:41.784+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.10" ```
Author
Owner

@dmikushin commented on GitHub (Feb 23, 2025):

Finally, the same 5.5 min for me:

[GIN] 2025/02/24 - 00:43:53 | 200 |         5m57s |       127.0.0.1 | POST     "/api/generate"
<!-- gh-comment-id:2676950224 --> @dmikushin commented on GitHub (Feb 23, 2025): Finally, the same 5.5 min for me: ``` [GIN] 2025/02/24 - 00:43:53 | 200 | 5m57s | 127.0.0.1 | POST "/api/generate" ```
Author
Owner

@pdevine commented on GitHub (Feb 24, 2025):

@dmikushin how is your fs set up? Just wondering if this is IO bound to the disk.

<!-- gh-comment-id:2677401411 --> @pdevine commented on GitHub (Feb 24, 2025): @dmikushin how is your fs set up? Just wondering if this is IO bound to the disk.
Author
Owner

@dmikushin commented on GitHub (Feb 24, 2025):

@pdevine Hard to tell more ATM. We haven't noticed disk I/O issues in other apps. I guess I need to digg into the Ollama internals. But overall this issue is valid and should not be closed.

<!-- gh-comment-id:2677855831 --> @dmikushin commented on GitHub (Feb 24, 2025): @pdevine Hard to tell more ATM. We haven't noticed disk I/O issues in other apps. I guess I need to digg into the Ollama internals. But overall this issue is valid and should not be closed.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2546