[GH-ISSUE #2839] keeps loading but never success #1725

Open
opened 2026-04-12 11:42:14 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @xudong2019 on GitHub (Feb 29, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2839

ollama run renxin_query_type_classify "hello"
image

I successfully generate a model from gguf file. however keeps loading but never succeed... Any idea what's happening?

FROM ./model_query_type_classify.gguf
PARAMETER temperature 0
SYSTEM """
classify user type
"""

Originally created by @xudong2019 on GitHub (Feb 29, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2839 ollama run renxin_query_type_classify "hello" ![image](https://github.com/ollama/ollama/assets/16278392/a7ee2375-3493-4d41-b5a6-aa4481e53baf) I successfully generate a model from gguf file. however keeps loading but never succeed... Any idea what's happening? FROM ./model_query_type_classify.gguf PARAMETER temperature 0 SYSTEM """ classify user type """
GiteaMirror added the createbug labels 2026-04-12 11:42:14 -05:00
Author
Owner

@jmorganca commented on GitHub (Feb 29, 2024):

Hi there, sorry this happened. Would it be possible to check the logs for an error?

<!-- gh-comment-id:1971643647 --> @jmorganca commented on GitHub (Feb 29, 2024): Hi there, sorry this happened. Would it be possible to [check the logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) for an error?
Author
Owner

@xudong2019 commented on GitHub (Mar 1, 2024):

I couldn't identify any error recorded in the log file. I'm running ollama in the docker and used "docker logs 1001d164775c" for the log file.

Here are a few contents in the log file, but no error message identified...
llama_new_context_with_model: freq_scale = 1
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB
llama_new_context_with_model: KV self size = 1600.00 MiB, K (f16): 800.00 MiB, V (f16): 800.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 15.02 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 307.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 10.00 MiB
llama_new_context_with_model: graph splits (measure): 3
time=2024-03-01T01:34:05.286Z level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
time=2024-03-01T01:34:05.289Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
[GIN] 2024/03/01 - 01:34:07 |^[[97;42m 200 ^[[0m| 7.799372595s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat"
[GIN] 2024/03/01 - 01:34:09 |^[[97;42m 200 ^[[0m| 1.921034882s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate"
time=2024-03-01T01:35:11.330Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
[GIN] 2024/03/01 - 01:35:14 |^[[97;42m 200 ^[[0m| 2.714265698s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat"
[GIN] 2024/03/01 - 01:35:16 |^[[97;42m 200 ^[[0m| 2.700883398s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate"
time=2024-03-01T01:35:37.347Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
[GIN] 2024/03/01 - 01:35:39 |^[[97;42m 200 ^[[0m| 1.983048092s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat"
[GIN] 2024/03/01 - 01:35:41 |^[[97;42m 200 ^[[0m| 2.288229176s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate"
time=2024-03-01T01:35:52.671Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
[GIN] 2024/03/01 - 01:35:54 |^[[97;42m 200 ^[[0m| 2.117185287s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat"
[GIN] 2024/03/01 - 01:35:57 |^[[97;42m 200 ^[[0m| 2.741038553s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate"
time=2024-03-01T01:37:24.107Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
[GIN] 2024/03/01 - 01:37:26 |^[[97;42m 200 ^[[0m| 2.556465865s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat"
[GIN] 2024/03/01 - 01:37:29 |^[[97;42m 200 ^[[0m| 2.56224256s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate"
time=2024-03-01T01:39:57.186Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
[GIN] 2024/03/01 - 01:39:58 |^[[97;42m 200 ^[[0m| 1.7195883s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat"
[GIN] 2024/03/01 - 01:40:01 |^[[97;42m 200 ^[[0m| 1.606466082s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate"
time=2024-03-01T01:41:39.096Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
[GIN] 2024/03/01 - 01:41:40 |^[[97;42m 200 ^[[0m| 1.688323303s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat"
[GIN] 2024/03/01 - 01:41:42 |^[[97;42m 200 ^[[0m| 1.699074626s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate"
time=2024-03-01T01:44:01.447Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
[GIN] 2024/03/01 - 01:44:03 |^[[97;42m 200 ^[[0m| 2.011641752s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat"
[GIN] 2024/03/01 - 01:44:05 |^[[97;42m 200 ^[[0m| 1.939643832s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate"
time=2024-03-01T01:45:39.781Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
[GIN] 2024/03/01 - 01:45:41 |^[[97;42m 200 ^[[0m| 1.913668125s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat"
[GIN] 2024/03/01 - 01:45:44 |^[[97;42m 200 ^[[0m| 2.054726132s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate"

and earlier:

llama_new_context_with_model: CUDA_Host input buffer size = 15.02 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 307.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 10.00 MiB
llama_new_context_with_model: graph splits (measure): 3
time=2024-03-01T01:28:39.193Z level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
[GIN] 2024/03/01 - 01:28:41 |^[[97;42m 200 ^[[0m| 7.861097766s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate"
[GIN] 2024/03/01 - 01:29:51 |^[[97;42m 200 ^[[0m| 2.259164053s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat"
[GIN] 2024/03/01 - 01:29:52 |^[[97;42m 200 ^[[0m| 1.839495069s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate"
time=2024-03-01T01:30:25.356Z level=INFO source=routes.go:78 msg="changing loaded model"
time=2024-03-01T01:30:28.171Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-01T01:30:28.171Z level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-01T01:30:28.171Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-01T01:30:28.171Z level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-01T01:30:28.171Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library /tmp/ollama3442133364/cuda_v11/libext_server.so
time=2024-03-01T01:30:28.171Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama3442133364/cuda_v11/libext_server.so"
time=2024-03-01T01:30:28.171Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from /root/.ollama/models/blobs/sha256:456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma
llama_model_loader: - kv 1: general.name str = gemma-7b-it
llama_model_loader: - kv 2: gemma.context_length u32 = 8192
llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072
llama_model_loader: - kv 4: gemma.block_count u32 = 28
llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576
llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16
llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256
llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 2

<!-- gh-comment-id:1972489711 --> @xudong2019 commented on GitHub (Mar 1, 2024): I couldn't identify any error recorded in the log file. I'm running ollama in the docker and used "docker logs 1001d164775c" for the log file. Here are a few contents in the log file, but no error message identified... llama_new_context_with_model: freq_scale = 1 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes ggml_init_cublas: CUDA_USE_TENSOR_CORES: no ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB llama_new_context_with_model: KV self size = 1600.00 MiB, K (f16): 800.00 MiB, V (f16): 800.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 15.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 307.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 10.00 MiB llama_new_context_with_model: graph splits (measure): 3 time=2024-03-01T01:34:05.286Z level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" time=2024-03-01T01:34:05.289Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" [GIN] 2024/03/01 - 01:34:07 |^[[97;42m 200 ^[[0m| 7.799372595s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat" [GIN] 2024/03/01 - 01:34:09 |^[[97;42m 200 ^[[0m| 1.921034882s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate" time=2024-03-01T01:35:11.330Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" [GIN] 2024/03/01 - 01:35:14 |^[[97;42m 200 ^[[0m| 2.714265698s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat" [GIN] 2024/03/01 - 01:35:16 |^[[97;42m 200 ^[[0m| 2.700883398s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate" time=2024-03-01T01:35:37.347Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" [GIN] 2024/03/01 - 01:35:39 |^[[97;42m 200 ^[[0m| 1.983048092s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat" [GIN] 2024/03/01 - 01:35:41 |^[[97;42m 200 ^[[0m| 2.288229176s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate" time=2024-03-01T01:35:52.671Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" [GIN] 2024/03/01 - 01:35:54 |^[[97;42m 200 ^[[0m| 2.117185287s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat" [GIN] 2024/03/01 - 01:35:57 |^[[97;42m 200 ^[[0m| 2.741038553s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate" time=2024-03-01T01:37:24.107Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" [GIN] 2024/03/01 - 01:37:26 |^[[97;42m 200 ^[[0m| 2.556465865s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat" [GIN] 2024/03/01 - 01:37:29 |^[[97;42m 200 ^[[0m| 2.56224256s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate" time=2024-03-01T01:39:57.186Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" [GIN] 2024/03/01 - 01:39:58 |^[[97;42m 200 ^[[0m| 1.7195883s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat" [GIN] 2024/03/01 - 01:40:01 |^[[97;42m 200 ^[[0m| 1.606466082s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate" time=2024-03-01T01:41:39.096Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" [GIN] 2024/03/01 - 01:41:40 |^[[97;42m 200 ^[[0m| 1.688323303s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat" [GIN] 2024/03/01 - 01:41:42 |^[[97;42m 200 ^[[0m| 1.699074626s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate" time=2024-03-01T01:44:01.447Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" [GIN] 2024/03/01 - 01:44:03 |^[[97;42m 200 ^[[0m| 2.011641752s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat" [GIN] 2024/03/01 - 01:44:05 |^[[97;42m 200 ^[[0m| 1.939643832s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate" time=2024-03-01T01:45:39.781Z level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" [GIN] 2024/03/01 - 01:45:41 |^[[97;42m 200 ^[[0m| 1.913668125s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat" [GIN] 2024/03/01 - 01:45:44 |^[[97;42m 200 ^[[0m| 2.054726132s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate" and earlier: llama_new_context_with_model: CUDA_Host input buffer size = 15.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 307.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 10.00 MiB llama_new_context_with_model: graph splits (measure): 3 time=2024-03-01T01:28:39.193Z level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" [GIN] 2024/03/01 - 01:28:41 |^[[97;42m 200 ^[[0m| 7.861097766s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate" [GIN] 2024/03/01 - 01:29:51 |^[[97;42m 200 ^[[0m| 2.259164053s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/chat" [GIN] 2024/03/01 - 01:29:52 |^[[97;42m 200 ^[[0m| 1.839495069s | 172.19.0.3 |^[[97;46m POST ^[[0m "/api/generate" time=2024-03-01T01:30:25.356Z level=INFO source=routes.go:78 msg="changing loaded model" time=2024-03-01T01:30:28.171Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-01T01:30:28.171Z level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-01T01:30:28.171Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-01T01:30:28.171Z level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-01T01:30:28.171Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" loading library /tmp/ollama3442133364/cuda_v11/libext_server.so time=2024-03-01T01:30:28.171Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama3442133364/cuda_v11/libext_server.so" time=2024-03-01T01:30:28.171Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from /root/.ollama/models/blobs/sha256:456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-7b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072 llama_model_loader: - kv 4: gemma.block_count u32 = 28 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 2
Author
Owner

@xudong2019 commented on GitHub (Mar 1, 2024):

by the way this is my modelfile

FROM ./model_query_type_classify.gguf
PARAMETER temperature 0
SYSTEM """
classify user input
"""

<!-- gh-comment-id:1972515666 --> @xudong2019 commented on GitHub (Mar 1, 2024): by the way this is my modelfile FROM ./model_query_type_classify.gguf PARAMETER temperature 0 SYSTEM """ classify user input """
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1725