[GH-ISSUE #10957] Ollama not returning any response on HPC cluster #7218

Closed
opened 2026-04-12 19:12:58 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @farzadsbmila on GitHub (Jun 3, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10957

What is the issue?

I'm running Ollama (0.9.0) for the first time on a new system (HPC cluster). Installed using conda install -c conda-forge ollama. I have checked that the 11434 port is usable (nc -zv 127.0.0.1 11434 works after ollama serve).

ollama serve and ollama run llama3.2:1b (and other models) work, but the server is unresponsive. After I interrupt a request using control-C, a log entry shows up on the server with a 200 status code for the /api/chat endpoint, but a response is never returned. i tried with curl and the prompt from ollama run.

Using python 3.12.9, cuda 12.6.0, Ubuntu 22.04.3 LTS, Nvidia l40s (also tried with RTX8000).
Also tried with python 3.10 and cuda 12.5.

Relevant log output


OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.9.0

Originally created by @farzadsbmila on GitHub (Jun 3, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10957 ### What is the issue? I'm running Ollama (0.9.0) for the first time on a new system (HPC cluster). Installed using `conda install -c conda-forge ollama`. I have checked that the `11434` **port** is usable (`nc -zv 127.0.0.1 11434` works after `ollama serve`). `ollama serve` and `ollama run llama3.2:1b` (and other models) work, but the server is unresponsive. After I interrupt a request using control-C, a log entry shows up on the server with a `200` status code for the `/api/chat` endpoint, but a response is never returned. i tried with `curl` and the prompt from `ollama run`. Using `python 3.12.9`, `cuda 12.6.0`, `Ubuntu 22.04.3 LTS`, `Nvidia l40s (also tried with RTX8000)`. Also tried with `python 3.10` and `cuda 12.5`. ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU _No response_ ### Ollama version 0.9.0
GiteaMirror added the bug label 2026-04-12 19:12:58 -05:00
Author
Owner

@rick-github commented on GitHub (Jun 3, 2025):

If you could show your actual commands and the output it would be more informative than just your summary. Server logs may also be helpful.

<!-- gh-comment-id:2936449204 --> @rick-github commented on GitHub (Jun 3, 2025): If you could show your actual commands and the output it would be more informative than just your summary. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may also be helpful.
Author
Owner

@farzadsbmila commented on GitHub (Jun 3, 2025):

Thanks @rick-github.
I can't access the output from journalctl because of permissions (I'm on an HPC cluster).
This is the output from ollama serve. I ran ollama run llama3.2:1b after. That command has no output. Only a prompt which does not return when input is provided.
In the API logs at the end, the first entry shows up when I run ollama run .... The second entry when I supply a prompt in ollama run ... and control-C after some time. (it doesn't return within 20 minutes).

time=2025-06-03T15:43:07.530-04:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL:0 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/.../.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0 http_proxy: https_proxy: no_proxy:]"
time=2025-06-03T15:43:07.535-04:00 level=INFO source=images.go:479 msg="total blobs: 18"
time=2025-06-03T15:43:07.537-04:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0"
time=2025-06-03T15:43:07.540-04:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:11434 (version 0.9.0)"
time=2025-06-03T15:43:07.540-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-06-03T15:43:07.763-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-c662cfc4-ffb7-8e7d-7458-f85b8ac04ddd library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
[GIN] 2025/06/03 - 15:44:29 | 200 |     125.132µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/03 - 15:44:29 | 200 |   44.402756ms |       127.0.0.1 | POST     "/api/show"
time=2025-06-03T15:44:30.008-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/home/.../.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 gpu=GPU-c662cfc4-ffb7-8e7d-7458-f85b8ac04ddd parallel=2 available=33747894272 required="2.5 GiB"
time=2025-06-03T15:44:30.162-04:00 level=INFO source=server.go:135 msg="system memory" total="376.8 GiB" free="345.5 GiB" free_swap="0 B"
time=2025-06-03T15:44:30.162-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=17 layers.offload=17 layers.split="" memory.available="[31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.5 GiB" memory.required.partial="2.5 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[2.5 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="986.2 MiB" memory.weights.nonrepeating="266.2 MiB" memory.graph.full="544.0 MiB" memory.graph.partial="554.3 MiB"
llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /home/.../.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 16
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  18:                          general.file_type u32              = 7
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q8_0:  113 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 1.22 GiB (8.50 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 1.24 B
print_info: general.name     = Llama 3.2 1B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-06-03T15:44:30.494-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/home/.../.conda/envs/ollama/bin/ollama runner --model /home/.../.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 --ctx-size 8192 --batch-size 512 --n-gpu-layers 17 --threads 40 --parallel 2 --port 43539"
time=2025-06-03T15:44:30.495-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-03T15:44:30.495-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-06-03T15:44:30.496-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-06-03T15:44:30.518-04:00 level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /home/.../.conda/envs/ollama/lib/ollama/libggml-cpu-skylakex.so
time=2025-06-03T15:44:30.535-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-06-03T15:44:30.535-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:43539"
llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /home/.../.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 16
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  18:                          general.file_type u32              = 7
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q8_0:  113 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 1.22 GiB (8.50 BPW) 
time=2025-06-03T15:44:30.748-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2048
print_info: n_layer          = 16
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 1.24 B
print_info: general.name     = Llama 3.2 1B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  1252.41 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.99 MiB
llama_kv_cache_unified: kv_size = 8192, type_k = 'f16', type_v = 'f16', n_layer = 16, can_shift = 1, padding = 32
llama_kv_cache_unified:        CPU KV buffer size =   256.00 MiB
llama_kv_cache_unified: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_context:        CPU compute buffer size =   544.01 MiB
llama_context: graph nodes  = 550
llama_context: graph splits = 1
time=2025-06-03T15:44:31.251-04:00 level=INFO source=server.go:630 msg="llama runner started in 0.76 seconds"
[GIN] 2025/06/03 - 15:44:31 | 200 |  1.482576636s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/06/03 - 15:45:20 | 200 | 13.390631803s |       127.0.0.1 | POST     "/api/chat"

<!-- gh-comment-id:2936947781 --> @farzadsbmila commented on GitHub (Jun 3, 2025): Thanks @rick-github. I can't access the output from `journalctl` because of permissions (I'm on an HPC cluster). This is the output from `ollama serve`. I ran `ollama run llama3.2:1b` after. That command has no output. Only a prompt which does not return when input is provided. In the API logs at the end, the first entry shows up when I run `ollama run ...`. The second entry when I supply a prompt in `ollama run ...` and control-C after some time. (it doesn't return within 20 minutes). ``` time=2025-06-03T15:43:07.530-04:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL:0 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/.../.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0 http_proxy: https_proxy: no_proxy:]" time=2025-06-03T15:43:07.535-04:00 level=INFO source=images.go:479 msg="total blobs: 18" time=2025-06-03T15:43:07.537-04:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0" time=2025-06-03T15:43:07.540-04:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:11434 (version 0.9.0)" time=2025-06-03T15:43:07.540-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-06-03T15:43:07.763-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-c662cfc4-ffb7-8e7d-7458-f85b8ac04ddd library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" [GIN] 2025/06/03 - 15:44:29 | 200 | 125.132µs | 127.0.0.1 | HEAD "/" [GIN] 2025/06/03 - 15:44:29 | 200 | 44.402756ms | 127.0.0.1 | POST "/api/show" time=2025-06-03T15:44:30.008-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/home/.../.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 gpu=GPU-c662cfc4-ffb7-8e7d-7458-f85b8ac04ddd parallel=2 available=33747894272 required="2.5 GiB" time=2025-06-03T15:44:30.162-04:00 level=INFO source=server.go:135 msg="system memory" total="376.8 GiB" free="345.5 GiB" free_swap="0 B" time=2025-06-03T15:44:30.162-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=17 layers.offload=17 layers.split="" memory.available="[31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.5 GiB" memory.required.partial="2.5 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[2.5 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="986.2 MiB" memory.weights.nonrepeating="266.2 MiB" memory.graph.full="544.0 MiB" memory.graph.partial="554.3 MiB" llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /home/.../.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 1B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 1B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 16 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 2048 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 32 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 64 llama_model_loader: - kv 17: llama.attention.value_length u32 = 64 llama_model_loader: - kv 18: general.file_type u32 = 7 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 34 tensors llama_model_loader: - type q8_0: 113 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 1.22 GiB (8.50 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 1.24 B print_info: general.name = Llama 3.2 1B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-06-03T15:44:30.494-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/home/.../.conda/envs/ollama/bin/ollama runner --model /home/.../.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 --ctx-size 8192 --batch-size 512 --n-gpu-layers 17 --threads 40 --parallel 2 --port 43539" time=2025-06-03T15:44:30.495-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-03T15:44:30.495-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-06-03T15:44:30.496-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-06-03T15:44:30.518-04:00 level=INFO source=runner.go:815 msg="starting go runner" load_backend: loaded CPU backend from /home/.../.conda/envs/ollama/lib/ollama/libggml-cpu-skylakex.so time=2025-06-03T15:44:30.535-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-06-03T15:44:30.535-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:43539" llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /home/.../.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 1B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 1B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 16 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 2048 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 32 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 64 llama_model_loader: - kv 17: llama.attention.value_length u32 = 64 llama_model_loader: - kv 18: general.file_type u32 = 7 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 34 tensors llama_model_loader: - type q8_0: 113 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 1.22 GiB (8.50 BPW) time=2025-06-03T15:44:30.748-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 2048 print_info: n_layer = 16 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 64 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 8192 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1B print_info: model params = 1.24 B print_info: general.name = Llama 3.2 1B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: CPU_Mapped model buffer size = 1252.41 MiB llama_context: constructing llama_context llama_context: n_seq_max = 2 llama_context: n_ctx = 8192 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 1024 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.99 MiB llama_kv_cache_unified: kv_size = 8192, type_k = 'f16', type_v = 'f16', n_layer = 16, can_shift = 1, padding = 32 llama_kv_cache_unified: CPU KV buffer size = 256.00 MiB llama_kv_cache_unified: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_context: CPU compute buffer size = 544.01 MiB llama_context: graph nodes = 550 llama_context: graph splits = 1 time=2025-06-03T15:44:31.251-04:00 level=INFO source=server.go:630 msg="llama runner started in 0.76 seconds" [GIN] 2025/06/03 - 15:44:31 | 200 | 1.482576636s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/06/03 - 15:45:20 | 200 | 13.390631803s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@rick-github commented on GitHub (Jun 3, 2025):

The log shows that the model was loaded into the GPU - the /api/generate is the model load command. Not quite a minute later the model was asked to do an inference (/api/chat), which took 13 seconds. So from the server point of view, it all looks good. What's not explained is why you don't see the output in the ollama CLI. What happens if you run the following:

curl localhost:11434/api/chat -d '{"model":"llama3.2:1b","messages":[{"role":"user","content":"hello"}]}'
<!-- gh-comment-id:2937037307 --> @rick-github commented on GitHub (Jun 3, 2025): The log shows that the model was loaded into the GPU - the `/api/generate` is the model load command. Not quite a minute later the model was asked to do an inference (`/api/chat`), which took 13 seconds. So from the server point of view, it all looks good. What's not explained is why you don't see the output in the ollama CLI. What happens if you run the following: ``` curl localhost:11434/api/chat -d '{"model":"llama3.2:1b","messages":[{"role":"user","content":"hello"}]}' ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7218