No response from ollama (Openshift / Kubernetes) #411

Closed
opened 2025-11-11 14:20:37 -06:00 by GiteaMirror · 0 comments
Owner

Originally created by @acocalypso on GitHub (Mar 4, 2024).

I set it up on an Openshift Cluster, Ollama and WebUI are running in CPU only mode and I can pull models, add prompts etc.

When I open a chat, select a model and ask a question its running for an eternity and I'm not getting any response.

On ollama server I see:

time=2024-03-04T09:26:35.289Z level=INFO source=images.go:710 msg="total blobs: 16"
time=2024-03-04T09:26:35.306Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-04T09:26:35.312Z level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.1.27)"
time=2024-03-04T09:26:35.313Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-04T09:26:38.152Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v5 cpu cpu_avx2 cuda_v11 rocm_v6 cpu_avx]"
time=2024-03-04T09:26:38.153Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-04T09:26:38.153Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-04T09:26:38.153Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-03-04T09:26:38.153Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-03-04T09:26:38.153Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-03-04T09:26:38.153Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-04T09:26:38.153Z level=INFO source=routes.go:1042 msg="no GPU detected"
time=2024-03-04T09:26:52.533Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-04T09:26:52.533Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-04T09:26:52.533Z level=INFO source=llm.go:77 msg="GPU not available, falling back to CPU"
loading library /tmp/ollama2542450014/cpu_avx2/libext_server.so
time=2024-03-04T09:26:52.534Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama2542450014/cpu_avx2/libext_server.so"
time=2024-03-04T09:26:52.534Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = codellama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  3647.95 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    13.02 MiB
llama_new_context_with_model:        CPU compute buffer size =   160.00 MiB
llama_new_context_with_model: graph splits (measure): 1
time=2024-03-04T09:28:02.822Z level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"

Webui:

INFO: 10.156.80.10:0 - "GET /litellm/api/v1/models HTTP/1.1" 200 OK
INFO: 10.156.80.10:0 - "POST /api/v1/chats/new HTTP/1.1" 200 OK
INFO: 10.156.80.10:0 - "GET /api/v1/chats/ HTTP/1.1" 200 OK

my ollama container isn't using any cpu therefore I assume its not doing anything at all.
image

Is there anyway to troubleshoot?

Originally created by @acocalypso on GitHub (Mar 4, 2024). I set it up on an Openshift Cluster, Ollama and WebUI are running in CPU only mode and I can pull models, add prompts etc. When I open a chat, select a model and ask a question its running for an eternity and I'm not getting any response. On ollama server I see: ``` time=2024-03-04T09:26:35.289Z level=INFO source=images.go:710 msg="total blobs: 16" time=2024-03-04T09:26:35.306Z level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-04T09:26:35.312Z level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.1.27)" time=2024-03-04T09:26:35.313Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-04T09:26:38.152Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v5 cpu cpu_avx2 cuda_v11 rocm_v6 cpu_avx]" time=2024-03-04T09:26:38.153Z level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-04T09:26:38.153Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so" time=2024-03-04T09:26:38.153Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []" time=2024-03-04T09:26:38.153Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so" time=2024-03-04T09:26:38.153Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []" time=2024-03-04T09:26:38.153Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-04T09:26:38.153Z level=INFO source=routes.go:1042 msg="no GPU detected" time=2024-03-04T09:26:52.533Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-04T09:26:52.533Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-04T09:26:52.533Z level=INFO source=llm.go:77 msg="GPU not available, falling back to CPU" loading library /tmp/ollama2542450014/cpu_avx2/libext_server.so time=2024-03-04T09:26:52.534Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama2542450014/cpu_avx2/libext_server.so" time=2024-03-04T09:26:52.534Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = codellama llama_model_loader: - kv 2: llama.context_length u32 = 16384 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32016 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 16384 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 6.74 B llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) llm_load_print_meta: general.name = codellama llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: CPU buffer size = 3647.95 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU input buffer size = 13.02 MiB llama_new_context_with_model: CPU compute buffer size = 160.00 MiB llama_new_context_with_model: graph splits (measure): 1 time=2024-03-04T09:28:02.822Z level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" ``` Webui: ``` INFO: 10.156.80.10:0 - "GET /litellm/api/v1/models HTTP/1.1" 200 OK INFO: 10.156.80.10:0 - "POST /api/v1/chats/new HTTP/1.1" 200 OK INFO: 10.156.80.10:0 - "GET /api/v1/chats/ HTTP/1.1" 200 OK ``` my ollama container isn't using any cpu therefore I assume its not doing anything at all. ![image](https://github.com/open-webui/open-webui/assets/2846629/c01d2eed-cf39-4907-9cb6-8be36991e72d) Is there anyway to troubleshoot?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#411