[GH-ISSUE #4098] llama 70b takes 5.5 min to load on A100 #2546

New Issue

GiteaMirror · 2026-04-12T12:52:06-05:00

GiteaMirror commented

2026-04-12 12:52:06 -05:00

Originally created by @rohidas-delcu on GitHub (May 2, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4098

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I've installed the model in the Ollama Docker pod successfully. However, when attempting to execute a query, there seems to be an issue. I've tried running "ollama run llama3:instruct," but the spinner just keeps spinning.

Here's a breakdown of the steps I've taken:

Executed the command to install the llama3 model:

ollama run llama3:instruct

After the installation completed, I immediately tried asking a question, but received no response. I waited for a considerable amount of time, but nothing changed.
I also attempted to run a curl command inside the pod, but encountered the same issue. The command seemed to get stuck, with no response.

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "What color is the sky at different times of the day?"
}'

I even checked the logs of the pod, but unfortunately, I didn't come across any helpful information.

Note: The server appeared to be up and listening on port 11434.

OS

Linux

GPU

No response

CPU

No response

Ollama version

llama3:instruct

Originally created by @rohidas-delcu on GitHub (May 2, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4098 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I've installed the model in the Ollama Docker pod successfully. However, when attempting to execute a query, there seems to be an issue. I've tried running "ollama run llama3:instruct," but the spinner just keeps spinning. Here's a breakdown of the steps I've taken: - Executed the command to install the llama3 model: ``` ollama run llama3:instruct ``` - After the installation completed, I immediately tried asking a question, but received no response. I waited for a considerable amount of time, but nothing changed. - I also attempted to run a curl command inside the pod, but encountered the same issue. The command seemed to get stuck, with no response. ``` curl http://localhost:11434/api/generate -d '{ "model": "llama3", "prompt": "What color is the sky at different times of the day?" }' ``` - I even checked the logs of the pod, but unfortunately, I didn't come across any helpful information. ![image](https://github.com/ollama/ollama/assets/112613598/37be3d25-7482-43c2-82a3-192ec7c5d500) ![image](https://github.com/ollama/ollama/assets/112613598/aa85c458-b724-4707-a9f6-964bf53e5c9f) ![image](https://github.com/ollama/ollama/assets/112613598/040a84ff-e18b-4166-b517-b119b68f3080) Note: The server appeared to be up and listening on port 11434. ### OS Linux ### GPU _No response_ ### CPU _No response_ ### Ollama version llama3:instruct

GiteaMirror added the docker bug nvidia labels 2026-04-12 12:52:06 -05:00

GiteaMirror closed this issue

2026-04-12 12:52:06 -05:00

GiteaMirror commented

2026-04-12 12:52:07 -05:00

@dhiltgen commented on GitHub (May 2, 2024):

Can you share your server log? Do you have a GPU, or is this running in CPU mode?

@dhiltgen commented on GitHub (May 2, 2024): Can you share your server log? Do you have a GPU, or is this running in CPU mode?

GiteaMirror commented

2026-04-12 12:52:08 -05:00

@rohidas-delcu commented on GitHub (May 3, 2024):

@dhiltgen,

Logs:

Pod Metrics:

@rohidas-delcu commented on GitHub (May 3, 2024): @dhiltgen, Logs: ![image](https://github.com/ollama/ollama/assets/112613598/9a4767a7-1490-4b09-a111-d642836e7c34) Pod Metrics: ![image](https://github.com/ollama/ollama/assets/112613598/d2ae8992-8833-491a-96db-67208895c6a5)

GiteaMirror commented

2026-04-12 12:52:08 -05:00

@MarkoSagadin commented on GitHub (May 3, 2024):

Hello,

I am facing the exact same issue from today. For me this is present in the Docker images of Ollama, both v0.1.32 and v0.1.33. I am running images inside VastAI instances.

I can successfully pull the model, but if I try to run it it hangs. It doesn't matter if it is through CLI or via HTTP API, the result is the same.

However I noticed some differences.

For example, the above never happens on the Vast instances with RTX 4090 GPU, there it works as expected.
But it hangs on more powerful ones, for example A100 or H100. Canceling and repeating run command doesn't change anything.

I have recorder the logs from the Ollama server when I run the ollama run command.

Good response on RTX 4090 with gemma:7b-instruct-v1.1-q4_0

root@C.10747901:~$ ollama run gemma:7b-instruct-v1.1-q4_0
time=2024-05-03T19:50:19.301Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
time=2024-05-03T19:50:19.303Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama3318894424/runners/cuda_v11/libcudart.so.11.0 count=2
time=2024-05-03T19:50:19.303Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
_ time=2024-05-03T19:50:21.101Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="23823.2 MiB" memory.required.full="6408.9 MiB" memory.required.partial=
"6408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="
1127.2 MiB"
time=2024-05-03T19:50:21.102Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="23823.2 MiB" memory.required.full="6408.9 MiB" memory.required.partial="6
408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="11
27.2 MiB"
time=2024-05-03T19:50:21.102Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-03T19:50:21.102Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama3318894424/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-ef311de6af9d
b043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 1 --port 40769"
time=2024-05-03T19:50:21.103Z level=INFO source=sched.go:340 msg="loaded runners" count=1
time=2024-05-03T19:50:21.103Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140553671749632","timestamp":1714765821}
{"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140553671749632","timestamp":1714765821}
{"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":127,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA =
 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140553671749632","timestamp":1714765821,"total_thr
eads":255}
llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from /root/.ollama/models/blobs/sha256-ef311de6af9db043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-1.1-7b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   4:                          gemma.block_count u32              = 28
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
_ llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
_ llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   57 tensors
llama_model_loader: - type q4_0:  196 tensors
llama_model_loader: - type q6_K:    1 tensors
_ llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_rot            = 192
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 24576
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.54 B
llm_load_print_meta: model size       = 4.66 GiB (4.69 BPW)
llm_load_print_meta: general.name     = gemma-1.1-7b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
_ llm_load_tensors: ggml ctx size =    0.26 MiB
_ llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   615.23 MiB
llm_load_tensors:      CUDA0 buffer size =  4773.90 MiB
_ .
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   896.00 MiB
llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   506.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 931
llama_new_context_with_model: graph splits = 2
_ {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140553671749632","timestamp":1714765823}
{"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140553671749632","timestamp":1714765823}
{"function":"main","level":"INFO","line":3067,"msg":"model loaded","tid":"140553671749632","timestamp":1714765823}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3270,"msg":"HTTP server listening","n_threads_http":"254","port":"40769","tid":"140553671749632","timestamp":1714765823}
{"function":"update_slots","level":"INFO","line":1581,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"140553671749632","timestamp":1714765823}
{"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":0,"tid":"140553671749632","timestamp":1714765823}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":36898,"status":200,"tid":"140550372052992","time
stamp":1714765823}

I have cut the end of the above log, it successfully ends with a ready to use prompt line.

But on the other hand the below two hang indefinitely:

Bad run on H100 with gemma:7b-instruct-v1.1-q4_0

root@C.10747901:~$ ollama run gemma:7b-instruct-v1.1-q4_0
[GIN] 2024/05/03 - 20:02:19 | 200 |      25.739_s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/05/03 - 20:02:19 | 200 |     519.066_s |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/05/03 - 20:02:19 | 200 |     821.282_s |       127.0.0.1 | POST     "/api/show"
time=2024-05-03T20:02:19.037Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
time=2024-05-03T20:02:19.043Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1224074478/runners/cuda_v11/libcudart.so.11.0 count=2
time=2024-05-03T20:02:19.043Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
_ time=2024-05-03T20:02:20.616Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="80482.9 MiB" memory.required.full="6408.9 MiB" memory.required.partial=
"6408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="
1127.2 MiB"
time=2024-05-03T20:02:20.617Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="80482.9 MiB" memory.required.full="6408.9 MiB" memory.required.partial="6
408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="11
27.2 MiB"
time=2024-05-03T20:02:20.617Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-03T20:02:20.617Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1224074478/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-ef311de6af9d
b043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 1 --port 46235"
time=2024-05-03T20:02:20.618Z level=INFO source=sched.go:340 msg="loaded runners" count=1
time=2024-05-03T20:02:20.618Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140132691918848","timestamp":1714766540}
{"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140132691918848","timestamp":1714766540}
{"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":112,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA =
 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140132691918848","timestamp":1714766540,"total_thr
eads":224}
_ llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from /root/.ollama/models/blobs/sha256-ef311de6af9db043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 (version GGUF V3 (latest)
)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-1.1-7b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   4:                          gemma.block_count u32              = 28
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
_ llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
_ llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   57 tensors
llama_model_loader: - type q4_0:  196 tensors
llama_model_loader: - type q6_K:    1 tensors
_ llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_rot            = 192
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 24576
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.54 B
llm_load_print_meta: model size       = 4.66 GiB (4.69 BPW)
llm_load_print_meta: general.name     = gemma-1.1-7b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

Bad run on H100 with llama2:70b-chat-q4_0

root@C.10747901:~$ ollama run llama2:70b-chat-q4_0
[GIN] 2024/05/03 - 20:11:07 | 200 |      25.188_s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/05/03 - 20:11:07 | 200 |     478.553_s |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/05/03 - 20:11:07 | 200 |     262.277_s |       127.0.0.1 | POST     "/api/show"
time=2024-05-03T20:11:07.545Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
time=2024-05-03T20:11:07.553Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1224074478/runners/cuda_v11/libcudart.so.11.0 count=2
time=2024-05-03T20:11:07.553Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
_ time=2024-05-03T20:11:08.354Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=81 memory.available="80482.9 MiB" memory.required.full="38351.1 MiB" memory.required.partial
="38351.1 MiB" memory.required.kv="640.0 MiB" memory.weights.total="36930.1 MiB" memory.weights.repeating="36725.0 MiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.parti
al="348.0 MiB"
time=2024-05-03T20:11:08.356Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=81 memory.available="80482.9 MiB" memory.required.full="38351.1 MiB" memory.required.partial="
38351.1 MiB" memory.required.kv="640.0 MiB" memory.weights.total="36930.1 MiB" memory.weights.repeating="36725.0 MiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.partial
="348.0 MiB"
time=2024-05-03T20:11:08.356Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-03T20:11:08.357Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1224074478/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-68bbe6dc9cf4
2eb60c9a7f96137fb8d472f752de6ebf53e9942f267f1a1e2577 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --parallel 1 --port 44497"
time=2024-05-03T20:11:08.357Z level=INFO source=sched.go:340 msg="loaded runners" count=1
time=2024-05-03T20:11:08.357Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140574146117632","timestamp":1714767068}
{"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140574146117632","timestamp":1714767068}
{"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":112,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA =
 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140574146117632","timestamp":1714767068,"total_thr
eads":224}
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from /root/.ollama/models/blobs/sha256-68bbe6dc9cf42eb60c9a7f96137fb8d472f752de6ebf53e9942f267f1a1e2577 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,61249]   = ["_ t", "e r", "i n", "_ a", "e n...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  561 tensors
llama_model_loader: - type q6_K:    1 tensors
_ llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 36.20 GiB (4.51 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
_ llm_load_tensors: ggml ctx size =    0.74 MiB
_ llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:        CPU buffer size =   140.62 MiB
llm_load_tensors:      CUDA0 buffer size = 36930.11 MiB
_ .
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   640.00 MiB
llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
_ llama_new_context_with_model:  CUDA_Host  output buffer size =     0.15 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   324.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    20.01 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 2

@MarkoSagadin commented on GitHub (May 3, 2024): Hello, I am facing the exact same issue from today. For me this is present in the Docker images of Ollama, both `v0.1.32` and `v0.1.33`. I am running images inside VastAI instances. I can successfully pull the model, but if I try to run it it hangs. It doesn't matter if it is through CLI or via HTTP API, the result is the same. However I noticed some differences. For example, the above never happens on the Vast instances with RTX 4090 GPU, there it works as expected. But it hangs on more powerful ones, for example A100 or H100. Canceling and repeating run command doesn't change anything. I have recorder the logs from the Ollama server when I run the `ollama run` command. <details><summary>Good response on RTX 4090 with gemma:7b-instruct-v1.1-q4_0</summary> ``` root@C.10747901:~$ ollama run gemma:7b-instruct-v1.1-q4_0 time=2024-05-03T19:50:19.301Z level=INFO source=gpu.go:96 msg="Detecting GPUs" time=2024-05-03T19:50:19.303Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama3318894424/runners/cuda_v11/libcudart.so.11.0 count=2 time=2024-05-03T19:50:19.303Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" _ time=2024-05-03T19:50:21.101Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="23823.2 MiB" memory.required.full="6408.9 MiB" memory.required.partial= "6408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial=" 1127.2 MiB" time=2024-05-03T19:50:21.102Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="23823.2 MiB" memory.required.full="6408.9 MiB" memory.required.partial="6 408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="11 27.2 MiB" time=2024-05-03T19:50:21.102Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-05-03T19:50:21.102Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama3318894424/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-ef311de6af9d b043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 1 --port 40769" time=2024-05-03T19:50:21.103Z level=INFO source=sched.go:340 msg="loaded runners" count=1 time=2024-05-03T19:50:21.103Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140553671749632","timestamp":1714765821} {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140553671749632","timestamp":1714765821} {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":127,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140553671749632","timestamp":1714765821,"total_thr eads":255} llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from /root/.ollama/models/blobs/sha256-ef311de6af9db043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-1.1-7b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072 llama_model_loader: - kv 4: gemma.block_count u32 = 28 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama _ llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... _ llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 57 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q6_K: 1 tensors _ llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 192 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 24576 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.54 B llm_load_print_meta: model size = 4.66 GiB (4.69 BPW) llm_load_print_meta: general.name = gemma-1.1-7b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes _ llm_load_tensors: ggml ctx size = 0.26 MiB _ llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU buffer size = 615.23 MiB llm_load_tensors: CUDA0 buffer size = 4773.90 MiB _ . llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB llama_new_context_with_model: CUDA0 compute buffer size = 506.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB llama_new_context_with_model: graph nodes = 931 llama_new_context_with_model: graph splits = 2 _ {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140553671749632","timestamp":1714765823} {"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140553671749632","timestamp":1714765823} {"function":"main","level":"INFO","line":3067,"msg":"model loaded","tid":"140553671749632","timestamp":1714765823} {"function":"main","hostname":"127.0.0.1","level":"INFO","line":3270,"msg":"HTTP server listening","n_threads_http":"254","port":"40769","tid":"140553671749632","timestamp":1714765823} {"function":"update_slots","level":"INFO","line":1581,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"140553671749632","timestamp":1714765823} {"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":0,"tid":"140553671749632","timestamp":1714765823} {"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":36898,"status":200,"tid":"140550372052992","time stamp":1714765823} ``` </details> I have cut the end of the above log, it successfully ends with a ready to use prompt line. But on the other hand the below two hang indefinitely: <details><summary>Bad run on H100 with gemma:7b-instruct-v1.1-q4_0</summary> ``` root@C.10747901:~$ ollama run gemma:7b-instruct-v1.1-q4_0 [GIN] 2024/05/03 - 20:02:19 | 200 | 25.739_s | 127.0.0.1 | HEAD "/" [GIN] 2024/05/03 - 20:02:19 | 200 | 519.066_s | 127.0.0.1 | POST "/api/show" [GIN] 2024/05/03 - 20:02:19 | 200 | 821.282_s | 127.0.0.1 | POST "/api/show" time=2024-05-03T20:02:19.037Z level=INFO source=gpu.go:96 msg="Detecting GPUs" time=2024-05-03T20:02:19.043Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1224074478/runners/cuda_v11/libcudart.so.11.0 count=2 time=2024-05-03T20:02:19.043Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" _ time=2024-05-03T20:02:20.616Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="80482.9 MiB" memory.required.full="6408.9 MiB" memory.required.partial= "6408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial=" 1127.2 MiB" time=2024-05-03T20:02:20.617Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=29 memory.available="80482.9 MiB" memory.required.full="6408.9 MiB" memory.required.partial="6 408.9 MiB" memory.required.kv="672.0 MiB" memory.weights.total="4773.9 MiB" memory.weights.repeating="4158.7 MiB" memory.weights.nonrepeating="615.2 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="11 27.2 MiB" time=2024-05-03T20:02:20.617Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-05-03T20:02:20.617Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1224074478/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-ef311de6af9d b043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 1 --port 46235" time=2024-05-03T20:02:20.618Z level=INFO source=sched.go:340 msg="loaded runners" count=1 time=2024-05-03T20:02:20.618Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140132691918848","timestamp":1714766540} {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140132691918848","timestamp":1714766540} {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":112,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140132691918848","timestamp":1714766540,"total_thr eads":224} _ llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from /root/.ollama/models/blobs/sha256-ef311de6af9db043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77 (version GGUF V3 (latest) ) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-1.1-7b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072 llama_model_loader: - kv 4: gemma.block_count u32 = 28 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... _ llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... _ llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 57 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q6_K: 1 tensors _ llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 192 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 24576 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.54 B llm_load_print_meta: model size = 4.66 GiB (4.69 BPW) llm_load_print_meta: general.name = gemma-1.1-7b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes ``` </details> <details><summary>Bad run on H100 with llama2:70b-chat-q4_0</summary> ``` root@C.10747901:~$ ollama run llama2:70b-chat-q4_0 [GIN] 2024/05/03 - 20:11:07 | 200 | 25.188_s | 127.0.0.1 | HEAD "/" [GIN] 2024/05/03 - 20:11:07 | 200 | 478.553_s | 127.0.0.1 | POST "/api/show" [GIN] 2024/05/03 - 20:11:07 | 200 | 262.277_s | 127.0.0.1 | POST "/api/show" time=2024-05-03T20:11:07.545Z level=INFO source=gpu.go:96 msg="Detecting GPUs" time=2024-05-03T20:11:07.553Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1224074478/runners/cuda_v11/libcudart.so.11.0 count=2 time=2024-05-03T20:11:07.553Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" _ time=2024-05-03T20:11:08.354Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=81 memory.available="80482.9 MiB" memory.required.full="38351.1 MiB" memory.required.partial ="38351.1 MiB" memory.required.kv="640.0 MiB" memory.weights.total="36930.1 MiB" memory.weights.repeating="36725.0 MiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.parti al="348.0 MiB" time=2024-05-03T20:11:08.356Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=81 memory.available="80482.9 MiB" memory.required.full="38351.1 MiB" memory.required.partial=" 38351.1 MiB" memory.required.kv="640.0 MiB" memory.weights.total="36930.1 MiB" memory.weights.repeating="36725.0 MiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.partial ="348.0 MiB" time=2024-05-03T20:11:08.356Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-05-03T20:11:08.357Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1224074478/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-68bbe6dc9cf4 2eb60c9a7f96137fb8d472f752de6ebf53e9942f267f1a1e2577 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --parallel 1 --port 44497" time=2024-05-03T20:11:08.357Z level=INFO source=sched.go:340 msg="loaded runners" count=1 time=2024-05-03T20:11:08.357Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140574146117632","timestamp":1714767068} {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140574146117632","timestamp":1714767068} {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":112,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140574146117632","timestamp":1714767068,"total_thr eads":224} llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from /root/.ollama/models/blobs/sha256-68bbe6dc9cf42eb60c9a7f96137fb8d472f752de6ebf53e9942f267f1a1e2577 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 8192 llama_model_loader: - kv 4: llama.block_count u32 = 80 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 64 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,61249] = ["_ t", "e r", "i n", "_ a", "e n... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q4_0: 561 tensors llama_model_loader: - type q6_K: 1 tensors _ llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 68.98 B llm_load_print_meta: model size = 36.20 GiB (4.51 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes _ llm_load_tensors: ggml ctx size = 0.74 MiB _ llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: CPU buffer size = 140.62 MiB llm_load_tensors: CUDA0 buffer size = 36930.11 MiB _ . llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 640.00 MiB llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB _ llama_new_context_with_model: CUDA_Host output buffer size = 0.15 MiB llama_new_context_with_model: CUDA0 compute buffer size = 324.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 2566 llama_new_context_with_model: graph splits = 2 ``` </details>

GiteaMirror commented

2026-04-12 12:52:09 -05:00

@dhiltgen commented on GitHub (May 4, 2024):

@MarkoSagadin when you say "hang indefinitely" can you clarify how long we're talking? We've seen some cloud instances have quite slow I/O and can take a very long time to load models. Some users are reporting our default timeout of 10m isn't sufficient to load models on some setups. Does it eventually hit a model load timeout error at 10m?

@dhiltgen commented on GitHub (May 4, 2024): @MarkoSagadin when you say "hang indefinitely" can you clarify how long we're talking? We've seen some cloud instances have quite slow I/O and can take a very long time to load models. Some users are reporting our default timeout of 10m isn't sufficient to load models on some setups. Does it eventually hit a model load timeout error at 10m?

GiteaMirror commented

2026-04-12 12:52:09 -05:00

@MarkoSagadin commented on GitHub (May 5, 2024):

@dhiltgen it hung for at least 3 minutes, I had timeout for httpx requests inside my program set to this.

Longer I can't say (and it will be a while before I can test this, since I will be out of the office for next week).

For me it is weird that on RTX 4090 GPU running models seems to function, but not on H100 GPU, and that actually worked fine two days ago.

@MarkoSagadin commented on GitHub (May 5, 2024): @dhiltgen it hung for at least 3 minutes, I had timeout for httpx requests inside my program set to this. Longer I can't say (and it will be a while before I can test this, since I will be out of the office for next week). For me it is weird that on RTX 4090 GPU running models seems to function, but not on H100 GPU, and that actually worked fine two days ago.

GiteaMirror commented

2026-04-12 12:52:10 -05:00

@StrikerRUS commented on GitHub (May 14, 2024):

Sorry for offtop. @rk-spirinova You'd better hide your public and unprotected ip from your screenshot.

@StrikerRUS commented on GitHub (May 14, 2024): Sorry for offtop. @rk-spirinova You'd better hide your public and unprotected ip from your screenshot.

GiteaMirror commented

2026-04-12 12:52:11 -05:00

@MarkoSagadin commented on GitHub (May 15, 2024):

@dhiltgen I have made some new discoveries on this issue.

The issue is present on the H100, on docker versions 0.1.38, 0.1.37, 0.1.30 and 0.1.28 (I was interested to see how back I would need to go for this to work).
The issue is not present on RTX 4090 with above versions.

I let the /api/generate API (using llama3 model) to hang, to see if I would get to the 10 min timeout that you mentioned. To my surprise the model actually executed my request after 5 minutes and 30 seconds. Subsequent runs with the same model or a different one (phi3) were instantaneous.

This is a workaround that I can work with, if the hanging delays don't come back when switching between models, I am evaluating many of them...

Is there any extra information that I can provide? Such as instance information, nvidia driver versions, Ollama logs, etc. to help solving this issue?

@MarkoSagadin commented on GitHub (May 15, 2024): @dhiltgen I have made some new discoveries on this issue. * The issue is present on the H100, on docker versions 0.1.38, 0.1.37, 0.1.30 and 0.1.28 (I was interested to see how back I would need to go for this to work). * The issue is not present on RTX 4090 with above versions. I let the `/api/generate` API (using `llama3` model) to hang, to see if I would get to the 10 min timeout that you mentioned. To my surprise the model actually executed my request after 5 minutes and 30 seconds. Subsequent runs with the same model or a different one (`phi3`) were instantaneous. This is a workaround that I can work with, if the hanging delays don't come back when switching between models, I am evaluating many of them... Is there any extra information that I can provide? Such as instance information, nvidia driver versions, Ollama logs, etc. to help solving this issue?

GiteaMirror commented

2026-04-12 12:52:11 -05:00

@dhiltgen commented on GitHub (May 16, 2024):

@MarkoSagadin that's great to hear it did actually load and wasn't hung.

Now that we know it's not stuck, and is more a performance problem, the question is if there's a bug in the ollama code leading this to be unnecessarily inefficient, or if your cloud instance needs optimization perhaps. You can try adjusting the storage device where the models are being stored to make sure it is high performance storage. Also try different instance types (CPUs, memory) - perhaps you're using an instance type that is getting "starved" for some resource at the hypervisor level resulting in slow I/O transfers into the GPU.

Is there any extra information that I can provide? Such as instance information, nvidia driver versions, Ollama logs, etc. to help solving this issue?

You might want to try to explore some metrics on your node to see if there's an obvious bottleneck leading to this slow performance. For example iostat -dmx 5

If none of that helps you find a good combination, can you share what cloud provider you're using, and the VM configuration details so we can try to repro?

@dhiltgen commented on GitHub (May 16, 2024): @MarkoSagadin that's great to hear it did actually load and wasn't hung. Now that we know it's not stuck, and is more a performance problem, the question is if there's a bug in the ollama code leading this to be unnecessarily inefficient, or if your cloud instance needs optimization perhaps. You can try adjusting the storage device where the models are being stored to make sure it is high performance storage. Also try different instance types (CPUs, memory) - perhaps you're using an instance type that is getting "starved" for some resource at the hypervisor level resulting in slow I/O transfers into the GPU. > Is there any extra information that I can provide? Such as instance information, nvidia driver versions, Ollama logs, etc. to help solving this issue? You might want to try to explore some metrics on your node to see if there's an obvious bottleneck leading to this slow performance. For example `iostat -dmx 5` If none of that helps you find a good combination, can you share what cloud provider you're using, and the VM configuration details so we can try to repro?

GiteaMirror commented

2026-04-12 12:52:12 -05:00

@pdevine commented on GitHub (May 18, 2024):

I just tried this with a 2xA100 on Ubuntu 22.04 and everything is working correctly:

$ ollama ps
NAME           	ID          	SIZE  	PROCESSOR	UNTIL
llama3:instruct	a6990ed6be41	5.4 GB	100% GPU 	4 minutes from now

and the output from verbose:

total duration:       6.791328594s
load duration:        4.626741228s
prompt eval count:    189 token(s)
prompt eval duration: 180.487ms
prompt eval rate:     1047.17 tokens/s
eval count:           177 token(s)
eval duration:        1.839261s
eval rate:            96.23 tokens/s

This is w/ ollama 0.1.38.

Results for gemma:7b-instruct-v1.1-q4_0:

total duration:       1.004777727s
load duration:        2.342302ms
prompt eval count:    18 token(s)
prompt eval duration: 130.858ms
prompt eval rate:     137.55 tokens/s
eval count:           74 token(s)
eval duration:        739.935ms
eval rate:            100.01 tokens/s

And for llama2:70b-chat-q4_0:

total duration:       25.950922105s
load duration:        2.576583ms
prompt eval count:    29 token(s)
prompt eval duration: 587.475ms
prompt eval rate:     49.36 tokens/s
eval count:           612 token(s)
eval duration:        25.229529s
eval rate:            24.26 tokens/s

@pdevine commented on GitHub (May 18, 2024): I just tried this with a 2xA100 on Ubuntu 22.04 and everything is working correctly: ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3:instruct a6990ed6be41 5.4 GB 100% GPU 4 minutes from now ``` and the output from verbose: ``` total duration: 6.791328594s load duration: 4.626741228s prompt eval count: 189 token(s) prompt eval duration: 180.487ms prompt eval rate: 1047.17 tokens/s eval count: 177 token(s) eval duration: 1.839261s eval rate: 96.23 tokens/s ``` This is w/ ollama `0.1.38`. Results for `gemma:7b-instruct-v1.1-q4_0`: ``` total duration: 1.004777727s load duration: 2.342302ms prompt eval count: 18 token(s) prompt eval duration: 130.858ms prompt eval rate: 137.55 tokens/s eval count: 74 token(s) eval duration: 739.935ms eval rate: 100.01 tokens/s ``` And for `llama2:70b-chat-q4_0`: ``` total duration: 25.950922105s load duration: 2.576583ms prompt eval count: 29 token(s) prompt eval duration: 587.475ms prompt eval rate: 49.36 tokens/s eval count: 612 token(s) eval duration: 25.229529s eval rate: 24.26 tokens/s ```

GiteaMirror commented

2026-04-12 12:52:12 -05:00

@pdevine commented on GitHub (May 19, 2024):

@rk-spirinova can you update to 0.1.38 and try again? Also, what version of Linux are you running?

@pdevine commented on GitHub (May 19, 2024): @rk-spirinova can you update to `0.1.38` and try again? Also, what version of Linux are you running?

GiteaMirror commented

2026-04-12 12:52:13 -05:00

@hekmon commented on GitHub (May 24, 2024):

I had the same issue:

working great on a 3090
big performance issue on a H100
- 5min30 too to load a model (even a small embeddings model)
- inference was painfully slow (between 1s and 8s)

I decided to try to do a local build thanks to https://github.com/ollama/ollama/issues/4131#issuecomment-2097813973 :

updated CUDA toolkit from 12.4 to 12.5 (driver from 550 to 555)
checked out the v0.1.39 prerelease version
build ollama with the machine own updated CUDA
started ollama with the new OLLAMA_FLASH_ATTENTION=1 (this might not have to do anything with resolution but as I set it up, I report it anyway)

Everything is running super smooth now. My guess is that there is some issue for the A100/H100 cards with the CUDA version used by ollama for the official builds (binary and docker).

@hekmon commented on GitHub (May 24, 2024): I had the same issue: * working great on a 3090 * big performance issue on a H100 * 5min30 too to load a model (even a small embeddings model) * inference was painfully slow (between 1s and 8s) I decided to try to do a local build thanks to https://github.com/ollama/ollama/issues/4131#issuecomment-2097813973 : * updated [CUDA toolkit](https://developer.nvidia.com/cuda-downloads) from 12.4 to 12.5 ([driver](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/?_gl=1*71mrnc*_gcl_au*MTE3NzM4MzQzMC4xNzE2MjM3ODkw#switching-between-driver-module-flavors) from 550 to 555) * checked out the [v0.1.39](https://github.com/ollama/ollama/releases/tag/v0.1.39) prerelease version * [build](https://github.com/ollama/ollama/blob/main/docs/development.md#linux) ollama with the machine own updated CUDA * started ollama with the new `OLLAMA_FLASH_ATTENTION=1` (this might not have to do anything with resolution but as I set it up, I report it anyway) Everything is running super smooth now. My guess is that there is some issue for the A100/H100 cards with the CUDA version used by ollama for the official builds (binary and docker).

GiteaMirror commented

2026-04-12 12:52:13 -05:00

@pdevine commented on GitHub (May 24, 2024):

@hekmon what version of Linux are you using and are you using this in the cloud somewhere? I just want to see if I can duplicate the issue.

@pdevine commented on GitHub (May 24, 2024): @hekmon what version of Linux are you using and are you using this in the cloud somewhere? I just want to see if I can duplicate the issue.

GiteaMirror commented

2026-04-12 12:52:13 -05:00

@hekmon commented on GitHub (May 24, 2024):

This is a baremetal server unfortunatly, not a cloud instance. But I am confident that the issue lies within the A100/H100 and the CUDA version used to compile ollama, therfore any Ubuntu 22.04 with a A100/H100 should be able to replicate the issue.

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

Edit: just saw your previous comments where you told you tested it on Ubuntu 22.04 and 2xA100. That's a bummer. May be on on the model side then ? The 2 models I was using (with the experimental env var for multiples models to 4 and requests concurrency to 10) were:

mxbai for embeddings
deepseekcoder 6.7b (tried both quantized and fp16)

@hekmon commented on GitHub (May 24, 2024): This is a baremetal server unfortunatly, not a cloud instance. But I am confident that the issue lies within the A100/H100 and the CUDA version used to compile ollama, therfore any Ubuntu 22.04 with a A100/H100 should be able to replicate the issue. ``` No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.4 LTS Release: 22.04 Codename: jammy ``` Edit: just saw your previous comments where you told you tested it on Ubuntu 22.04 and 2xA100. That's a bummer. May be on on the model side then ? The 2 models I was using (with the experimental env var for multiples models to 4 and requests concurrency to 10) were: - mxbai for embeddings - deepseekcoder 6.7b (tried both quantized and fp16)

GiteaMirror commented

2026-04-12 12:52:14 -05:00

@pdevine commented on GitHub (May 24, 2024):

Edit: just saw your previous comments where you told you tested it on Ubuntu 22.04 and 2xA100. That's a bummer. May be on on the model side then ? The 2 models I was using (with the experimental env var for multiples models to 4 and requests concurrency to 10) were:

mxbai for embeddings

deepseekcoder 6.7b (tried both quantized and fp16)

It still can be a driver issue. I just used whatever drivers were available in the "ML-in-a-box" Ubuntu 22.04 image on Paperspace. I didn't try from scratch. I'll give that a shot soon.

@pdevine commented on GitHub (May 24, 2024): > Edit: just saw your previous comments where you told you tested it on Ubuntu 22.04 and 2xA100. That's a bummer. May be on on the model side then ? The 2 models I was using (with the experimental env var for multiples models to 4 and requests concurrency to 10) were: > > * mxbai for embeddings > * deepseekcoder 6.7b (tried both quantized and fp16) It still can be a driver issue. I just used whatever drivers were available in the "ML-in-a-box" Ubuntu 22.04 image on Paperspace. I didn't try from scratch. I'll give that a shot soon.

GiteaMirror commented

2026-04-12 12:52:14 -05:00

@pdevine commented on GitHub (May 27, 2024):

OK, this turns out to be an an nVidia problem where they updated the driver and it no longer loads the correct kernel modules.

#4652 fixes this, but the workaround is to run:

sudo modprobe nvidia
sudo modprobe nvidia_uvm

I'll go ahead and close out the issue.

@pdevine commented on GitHub (May 27, 2024): OK, this turns out to be an an nVidia problem where they updated the driver and it no longer loads the correct kernel modules. #4652 fixes this, but the workaround is to run: ``` sudo modprobe nvidia sudo modprobe nvidia_uvm ``` I'll go ahead and close out the issue.

GiteaMirror commented

2026-04-12 12:52:14 -05:00

@pdevine commented on GitHub (May 27, 2024):

Note, this is specifically w/ driver version 555. @hekmon if you do an ollama ps you can see that everything is loaded onto the CPU instead of the GPU, so it's not really the same problem that was initially reported (which I think was fixed before).

@pdevine commented on GitHub (May 27, 2024): Note, this is specifically w/ driver version `555`. @hekmon if you do an `ollama ps` you can see that everything is loaded onto the CPU instead of the GPU, so it's not really the same problem that was initially reported (which I think was fixed before).

GiteaMirror commented

2026-04-12 12:52:15 -05:00

@hekmon commented on GitHub (May 27, 2024):

That's odd.

I used to have driver 550 (updated my original post) and had the issue. I do have the 555 now but my localy compiled ollama correctly report GPU usage with ollama ps.

I will try to launch the original docker container of ollama 0.1.38 that I might still have unmodified to double check/confirm that it had the issue even with 555 while locally compiled ollama does not and report asap.

@hekmon commented on GitHub (May 27, 2024): That's odd. I used to have driver 550 (updated my original post) and had the issue. I do have the 555 now but my localy compiled ollama correctly report GPU usage with `ollama ps`. I will try to launch the original docker container of ollama 0.1.38 that I might still have unmodified to double check/confirm that it had the issue even with 555 while locally compiled ollama does not and report asap.

GiteaMirror commented

2026-04-12 12:52:15 -05:00

@hekmon commented on GitHub (May 28, 2024):

So I restarted my original ollama v0.1.38 container with the new CUDA/driver version. I do not have the 5min30 load time any more but it still does not work properly (symptoms are now closer to https://github.com/ollama/ollama/issues/4131):

Starting is ok
Then I try to load the mxbai embeddings model with the following curl : curl "http://127.0.0.1:11434/api/embeddings" -d '{"model": "mxbai-embed-large", "keep_alive": -1}'
The curl then blocks for 1min52 and returns {"error":"timed out waiting for llama runner to start - progress 1.00 - "}
During that time, a ollama ps clearly shows that GPU is in use (not CPU):

NAME                            ID              SIZE    PROCESSOR       UNTIL
mxbai-embed-large:latest        468836162de7    1.3 GB  100% GPU        Forever

After the error the model does not appear in ollama ps any more

ollama_v0.1.38.log

TL;DR
After the CUDA/driver upgrade, the symptoms changed (5min30 load time with an available model at the end -> model load timeout after ~2min) but ollama still does not work as expected. While a locally compiled ollama does work. In my opinion this has nothing to do with https://github.com/ollama/ollama/pull/4652 (my nvidia kernels modules are loaded).

@hekmon commented on GitHub (May 28, 2024): So I restarted my original ollama v0.1.38 container with the new CUDA/driver version. I do not have the 5min30 load time any more but it still does not work properly (symptoms are now closer to https://github.com/ollama/ollama/issues/4131): * Starting is ok * Then I try to load the mxbai embeddings model with the following curl : `curl "http://127.0.0.1:11434/api/embeddings" -d '{"model": "mxbai-embed-large", "keep_alive": -1}'` * The curl then blocks for 1min52 and returns `{"error":"timed out waiting for llama runner to start - progress 1.00 - "}` * During that time, a ollama ps clearly shows that GPU is in use (not CPU): ``` NAME ID SIZE PROCESSOR UNTIL mxbai-embed-large:latest 468836162de7 1.3 GB 100% GPU Forever ``` * After the error the model does not appear in `ollama ps` any more [ollama_v0.1.38.log](https://github.com/ollama/ollama/files/15467477/ollama_v0.1.38.log) **TL;DR** After the CUDA/driver upgrade, the symptoms changed (5min30 load time with an available model at the end -> model load timeout after ~2min) but ollama still does not work as expected. While a locally compiled ollama does work. In my opinion this has nothing to do with https://github.com/ollama/ollama/pull/4652 (my nvidia kernels modules are loaded).

GiteaMirror commented

2026-04-12 12:52:15 -05:00

@dhiltgen commented on GitHub (May 28, 2024):

I've made some recent changes to try to tighten up our timeouts on model loading but it looks like that new 1m timer might be too aggressive in some cases...

May 28 09:37:07 llmlab-01 docker[1758301]: time=2024-05-28T09:37:07.459Z level=DEBUG source=server.go:573 msg="model load progress 1.00"
May 28 09:38:07 llmlab-01 docker[1758301]: time=2024-05-28T09:38:07.494Z level=ERROR source=sched.go:344 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 1.00 - "

The underlying runner claims 100% loaded, but still doesn't come online a minute later.

@dhiltgen commented on GitHub (May 28, 2024): I've made some recent changes to try to tighten up our timeouts on model loading but it looks like that new 1m timer might be too aggressive in some cases... ``` May 28 09:37:07 llmlab-01 docker[1758301]: time=2024-05-28T09:37:07.459Z level=DEBUG source=server.go:573 msg="model load progress 1.00" May 28 09:38:07 llmlab-01 docker[1758301]: time=2024-05-28T09:38:07.494Z level=ERROR source=sched.go:344 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 1.00 - " ``` The underlying runner claims 100% loaded, but still doesn't come online a minute later.

GiteaMirror commented

2026-04-12 12:52:15 -05:00

@alonilon commented on GitHub (May 28, 2024):

Wow this thread is spot on !
Just had this issue today running ollama on openshift with gpu

I’ll update when I get to test this tomorrow morning!

@alonilon commented on GitHub (May 28, 2024): Wow this thread is spot on ! Just had this issue today running ollama on openshift with gpu I’ll update when I get to test this tomorrow morning!

GiteaMirror commented

2026-04-12 12:52:16 -05:00

@alonilon commented on GitHub (May 29, 2024):

Unfortunately after further testing today this does not seem to resolve the issue in my case

No further logs or information is given after the 5m timeout

Still waiting for the llama runner to start :(

@alonilon commented on GitHub (May 29, 2024): Unfortunately after further testing today this does not seem to resolve the issue in my case No further logs or information is given after the 5m timeout Still waiting for the llama runner to start :(

GiteaMirror commented

2026-04-12 12:52:16 -05:00

@dhiltgen commented on GitHub (May 29, 2024):

@alonilon can you set OLLAMA_DEBUG=1 and share the logs leading up to the timeout? I'm curious if it was stalled at 0%, 100% or some other behavior.

@dhiltgen commented on GitHub (May 29, 2024): @alonilon can you set `OLLAMA_DEBUG=1` and share the logs leading up to the timeout? I'm curious if it was stalled at 0%, 100% or some other behavior.

GiteaMirror commented

2026-04-12 12:52:16 -05:00

@alonilon commented on GitHub (May 30, 2024):

@dhiltgen
Unfortunately I cannot share the logs as the machine is disconnected from the internet

I can see the logs of the model being loaded all the way up to 1.00

Are there any other debug options I could/would try ?

Is there any way to debug the status of the runner itself ?

@alonilon commented on GitHub (May 30, 2024): @dhiltgen Unfortunately I cannot share the logs as the machine is disconnected from the internet I can see the logs of the model being loaded all the way up to 1.00 Are there any other debug options I could/would try ? Is there any way to debug the status of the runner itself ?

GiteaMirror commented

2026-04-12 12:52:17 -05:00

@dmikushin commented on GitHub (Feb 23, 2025):

With the latest Ollama I can confirm that the uploading time is extremely slow on A100 (please see below). The optimal CUDA host to device bandwidth should be 25 GB/sec, according to the bandwidthTest on the same machine. But maybe this is not an A100 problem. Another oddity is the disk, which is my case is fuse-overlayfs on top of XFS filesystem. I won't be surprised that this disk setup is not so well tested as native ext4.

llm_load_tensors: offloading 80 repeating layers to GPU                                                                                                      
llm_load_tensors: offloading output layer to GPU                                                                                                             
llm_load_tensors: offloaded 81/81 layers to GPU                                                                                                              
llm_load_tensors:    CUDA_Host model buffer size =  1064.62 MiB                                                                                              
llm_load_tensors:        CUDA0 model buffer size = 70429.66 MiB                                                                                              
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads                                                       
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0                                                                        
time=2025-02-24T00:22:14.131+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.01"                                                           
time=2025-02-24T00:22:20.923+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03"                                                           
time=2025-02-24T00:22:21.174+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03"                                                           
time=2025-02-24T00:22:21.425+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03"                                                           
time=2025-02-24T00:22:21.676+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03"                                                           
time=2025-02-24T00:22:22.682+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04"                                                           
time=2025-02-24T00:22:23.688+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04"                                                           
time=2025-02-24T00:22:24.694+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04"                                                           
time=2025-02-24T00:22:25.196+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04"                                                           
time=2025-02-24T00:22:25.448+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04"                                                           
time=2025-02-24T00:22:26.453+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05"                                                           
time=2025-02-24T00:22:27.459+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05"                                                           
time=2025-02-24T00:22:28.465+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05"                                                           
time=2025-02-24T00:22:28.967+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06"                                                           
time=2025-02-24T00:22:29.219+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06"
time=2025-02-24T00:22:30.226+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06"
time=2025-02-24T00:22:31.232+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06"
time=2025-02-24T00:22:32.489+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07"
time=2025-02-24T00:22:32.740+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07"
time=2025-02-24T00:22:32.991+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07"
time=2025-02-24T00:22:33.996+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07"
time=2025-02-24T00:22:35.252+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07"
time=2025-02-24T00:22:36.257+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08"
time=2025-02-24T00:22:36.508+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08"
time=2025-02-24T00:22:36.759+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08"
time=2025-02-24T00:22:37.010+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08"
time=2025-02-24T00:22:38.016+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08"
time=2025-02-24T00:22:39.021+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09"
time=2025-02-24T00:22:40.026+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09"
time=2025-02-24T00:22:40.277+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09"
time=2025-02-24T00:22:40.779+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09"
time=2025-02-24T00:22:41.784+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.10"

@dmikushin commented on GitHub (Feb 23, 2025): With the latest Ollama I can confirm that the uploading time is extremely slow on A100 (please see below). The optimal CUDA host to device bandwidth should be 25 GB/sec, according to the bandwidthTest on the same machine. But maybe this is not an A100 problem. Another oddity is the disk, which is my case is **fuse-overlayfs on top of XFS filesystem**. I won't be surprised that this disk setup is not so well tested as native ext4. ``` llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: CUDA_Host model buffer size = 1064.62 MiB llm_load_tensors: CUDA0 model buffer size = 70429.66 MiB load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-02-24T00:22:14.131+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.01" time=2025-02-24T00:22:20.923+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03" time=2025-02-24T00:22:21.174+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03" time=2025-02-24T00:22:21.425+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03" time=2025-02-24T00:22:21.676+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.03" time=2025-02-24T00:22:22.682+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04" time=2025-02-24T00:22:23.688+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04" time=2025-02-24T00:22:24.694+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04" time=2025-02-24T00:22:25.196+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04" time=2025-02-24T00:22:25.448+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.04" time=2025-02-24T00:22:26.453+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05" time=2025-02-24T00:22:27.459+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05" time=2025-02-24T00:22:28.465+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.05" time=2025-02-24T00:22:28.967+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06" time=2025-02-24T00:22:29.219+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06" time=2025-02-24T00:22:30.226+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06" time=2025-02-24T00:22:31.232+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.06" time=2025-02-24T00:22:32.489+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07" time=2025-02-24T00:22:32.740+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07" time=2025-02-24T00:22:32.991+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07" time=2025-02-24T00:22:33.996+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07" time=2025-02-24T00:22:35.252+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.07" time=2025-02-24T00:22:36.257+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08" time=2025-02-24T00:22:36.508+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08" time=2025-02-24T00:22:36.759+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08" time=2025-02-24T00:22:37.010+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08" time=2025-02-24T00:22:38.016+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.08" time=2025-02-24T00:22:39.021+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09" time=2025-02-24T00:22:40.026+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09" time=2025-02-24T00:22:40.277+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09" time=2025-02-24T00:22:40.779+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.09" time=2025-02-24T00:22:41.784+09:00 level=DEBUG source=server.go:602 msg="model load progress 0.10" ```

GiteaMirror commented

2026-04-12 12:52:17 -05:00

@dmikushin commented on GitHub (Feb 23, 2025):

Finally, the same 5.5 min for me:

[GIN] 2025/02/24 - 00:43:53 | 200 |         5m57s |       127.0.0.1 | POST     "/api/generate"

@dmikushin commented on GitHub (Feb 23, 2025): Finally, the same 5.5 min for me: ``` [GIN] 2025/02/24 - 00:43:53 | 200 | 5m57s | 127.0.0.1 | POST "/api/generate" ```

GiteaMirror commented

2026-04-12 12:52:18 -05:00

@pdevine commented on GitHub (Feb 24, 2025):

@dmikushin how is your fs set up? Just wondering if this is IO bound to the disk.

@pdevine commented on GitHub (Feb 24, 2025): @dmikushin how is your fs set up? Just wondering if this is IO bound to the disk.

GiteaMirror commented

2026-04-12 12:52:18 -05:00

@dmikushin commented on GitHub (Feb 24, 2025):

@pdevine Hard to tell more ATM. We haven't noticed disk I/O issues in other apps. I guess I need to digg into the Ollama internals. But overall this issue is valid and should not be closed.

@dmikushin commented on GitHub (Feb 24, 2025): @pdevine Hard to tell more ATM. We haven't noticed disk I/O issues in other apps. I guess I need to digg into the Ollama internals. But overall this issue is valid and should not be closed.

GiteaMirror referenced this issue

2026-04-12 23:16:24 -05:00

[PR #2600] [MERGED] Document setting server vars for windows #10942

GiteaMirror referenced this issue

2026-04-16 05:20:46 -05:00

[PR #2600] [MERGED] Document setting server vars for windows #16213

GiteaMirror referenced this issue

2026-04-19 15:39:55 -05:00

[PR #2600] [MERGED] Document setting server vars for windows #21482

GiteaMirror referenced this issue

2026-04-22 04:25:13 -05:00

[GH-ISSUE #2546] (windows), HOW TO INSTALL IT on DIFFERENT drives than C???? #27251

GiteaMirror referenced this issue

2026-04-22 21:26:51 -05:00

[PR #2600] [MERGED] Document setting server vars for windows #36815

GiteaMirror referenced this issue

2026-04-24 21:58:31 -05:00

[PR #2600] [MERGED] Document setting server vars for windows #42190

GiteaMirror referenced this issue

2026-04-28 06:23:00 -05:00

[GH-ISSUE #2546] (windows), HOW TO INSTALL IT on DIFFERENT drives than C???? #48003

GiteaMirror referenced this issue

2026-04-29 12:19:42 -05:00

[PR #2600] [MERGED] Document setting server vars for windows #57639

GiteaMirror referenced this issue

2026-05-03 14:03:30 -05:00

[GH-ISSUE #2546] (windows), HOW TO INSTALL IT on DIFFERENT drives than C???? #63529

GiteaMirror referenced this issue

2026-05-05 04:58:23 -05:00

[PR #2600] [MERGED] Document setting server vars for windows #73236

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-mlx-decode-checkpoints

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#2546