[GH-ISSUE #483] No response from model with giant request #222

Closed
opened 2026-04-12 09:44:39 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @FairyTail2000 on GitHub (Sep 7, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/483

Using my own personal frontend with the model codellama:34b-code-q4_0 I send a giant block of code ~10kB. The model then runs for 5 - 6 minutes but only a single token comes out of the model.
This is the http response:

{"model":"codellama:34b-code-q4_0","created_at":"2023-09-07T07:34:32.574995065Z","response":"\n","done":false}
{"model":"codellama:34b-code-q4_0","created_at":"2023-09-07T07:34:33.221286574Z","done":true,"context":[truncated],"total_duration":329330974773,"load_duration":688284882,"prompt_eval_count":1207,"prompt_eval_duration":327988245000,"eval_count":1,"eval_duration":641399000}

I cannot give you the code I used since it's proprietary but you can use any big blob of code I think.

Here is also the log output generated by ollama:

[GIN] 2023/09/07 - 09:28:50 | 200 | 1.680576ms | 127.0.0.1 | GET "/api/tags"
2023/09/07 09:29:03 ggml_llama.go:311: starting llama.cpp server
2023/09/07 09:29:03 ggml_llama.go:333: waiting for llama.cpp server to start responding
{"timestamp":1694071743,"level":"WARNING","function":"server_params_parse","line":845,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":0}
{"timestamp":1694071743,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"}
{"timestamp":1694071743,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":8,"total_threads":16,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | "}
llama server listening at http://127.0.0.1:61088
{"timestamp":1694071744,"level":"INFO","function":"main","line":1443,"message":"HTTP server listening","hostname":"127.0.0.1","port":61088}
{"timestamp":1694071744,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":41400,"status":200,"method":"HEAD","path":"/","params":{}}
2023/09/07 09:29:04 ggml_llama.go:342: llama.cpp server started in 0.601793 seconds
{"timestamp":1694071744,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":41400,"status":200,"method":"POST","path":"/tokenize","params":{}}
{"timestamp":1694071744,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":41400,"status":200,"method":"POST","path":"/tokenize","params":{}}
{"timestamp":1694072073,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":41400,"status":200,"method":"POST","path":"/completion","params":{}}
{"timestamp":1694072073,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":41400,"status":200,"method":"POST","path":"/tokenize","params":{}}
[GIN] 2023/09/07 - 09:34:33 | 200 | 5m29s | 127.0.0.1 | POST "/api/generate"

llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 48
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: freq_base = 1000000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 34B
llama_model_load_internal: ggml ctx size = 0.13 MB
llama_model_load_internal: mem required = 18168.87 MB (+ 384.00 MB per state)
llama_new_context_with_model: kv self size = 384.00 MB
llama_new_context_with_model: compute buffer total size = 305.35 MB

llama_print_timings: load time = 134902.74 ms
llama_print_timings: sample time = 1.16 ms / 2 runs ( 0.58 ms per token, 1730.10 tokens per second)
llama_print_timings: prompt eval time = 327988.24 ms / 1207 tokens ( 271.74 ms per token, 3.68 tokens per second)
llama_print_timings: eval time = 641.40 ms / 1 runs ( 641.40 ms per token, 1.56 tokens per second)
llama_print_timings: total time = 328637.49 ms

If anything else is needed to debug the issue I would be happy to provide it.

An Idea I already have is that the size of the input is magnitutes larger than the context thus it errors out somewhere silently

Originally created by @FairyTail2000 on GitHub (Sep 7, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/483 Using my own personal frontend with the model codellama:34b-code-q4_0 I send a giant block of code ~10kB. The model then runs for 5 - 6 minutes but only a single token comes out of the model. This is the http response: >{"model":"codellama:34b-code-q4_0","created_at":"2023-09-07T07:34:32.574995065Z","response":"\n","done":false} >{"model":"codellama:34b-code-q4_0","created_at":"2023-09-07T07:34:33.221286574Z","done":true,"context":[truncated],"total_duration":329330974773,"load_duration":688284882,"prompt_eval_count":1207,"prompt_eval_duration":327988245000,"eval_count":1,"eval_duration":641399000} I cannot give you the code I used since it's proprietary but you can use any big blob of code I think. Here is also the log output generated by ollama: > [GIN] 2023/09/07 - 09:28:50 | 200 | 1.680576ms | 127.0.0.1 | GET "/api/tags" 2023/09/07 09:29:03 ggml_llama.go:311: starting llama.cpp server 2023/09/07 09:29:03 ggml_llama.go:333: waiting for llama.cpp server to start responding {"timestamp":1694071743,"level":"WARNING","function":"server_params_parse","line":845,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":0} {"timestamp":1694071743,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"} {"timestamp":1694071743,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":8,"total_threads":16,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | "} llama server listening at http://127.0.0.1:61088 {"timestamp":1694071744,"level":"INFO","function":"main","line":1443,"message":"HTTP server listening","hostname":"127.0.0.1","port":61088} {"timestamp":1694071744,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":41400,"status":200,"method":"HEAD","path":"/","params":{}} 2023/09/07 09:29:04 ggml_llama.go:342: llama.cpp server started in 0.601793 seconds {"timestamp":1694071744,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":41400,"status":200,"method":"POST","path":"/tokenize","params":{}} {"timestamp":1694071744,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":41400,"status":200,"method":"POST","path":"/tokenize","params":{}} {"timestamp":1694072073,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":41400,"status":200,"method":"POST","path":"/completion","params":{}} {"timestamp":1694072073,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":41400,"status":200,"method":"POST","path":"/tokenize","params":{}} [GIN] 2023/09/07 - 09:34:33 | 200 | 5m29s | 127.0.0.1 | POST "/api/generate" > llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 8192 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 64 llama_model_load_internal: n_head_kv = 8 llama_model_load_internal: n_layer = 48 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 8 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 22016 llama_model_load_internal: freq_base = 1000000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: model size = 34B llama_model_load_internal: ggml ctx size = 0.13 MB llama_model_load_internal: mem required = 18168.87 MB (+ 384.00 MB per state) llama_new_context_with_model: kv self size = 384.00 MB llama_new_context_with_model: compute buffer total size = 305.35 MB > llama_print_timings: load time = 134902.74 ms llama_print_timings: sample time = 1.16 ms / 2 runs ( 0.58 ms per token, 1730.10 tokens per second) llama_print_timings: prompt eval time = 327988.24 ms / 1207 tokens ( 271.74 ms per token, 3.68 tokens per second) llama_print_timings: eval time = 641.40 ms / 1 runs ( 641.40 ms per token, 1.56 tokens per second) llama_print_timings: total time = 328637.49 ms If anything else is needed to debug the issue I would be happy to provide it. An Idea I already have is that the size of the input is magnitutes larger than the context thus it errors out somewhere silently
GiteaMirror added the bug label 2026-04-12 09:44:39 -05:00
Author
Owner

@mxyng commented on GitHub (Sep 7, 2023):

Context size is set to 2048 tokens in your example so passing in input of >2048 tokens will have zero benefits as it will be truncated to approx. half of the input. You can increase the context with but you will quickly notice performance issues.

Larger contexts require more time to evaluate. Judging from your timing outputs, it looks like you're using CPU and evaluating the prompt at roughly 3.5 tokens/s. Increasing the context size will likely make it more unusable.

Increasing the context window also increases memory usage.

You can set the context size with PARAMETER num_ctx <size> in the Modelfile or {"options": {"num_ctx": <size>}} in the generate request.

 llama_model_load_internal: n_ctx = 2048
<!-- gh-comment-id:1710486799 --> @mxyng commented on GitHub (Sep 7, 2023): Context size is set to 2048 tokens in your example so passing in input of >2048 tokens will have zero benefits as it will be truncated to approx. half of the input. You can increase the context with but you will quickly notice performance issues. Larger contexts require more time to evaluate. Judging from your timing outputs, it looks like you're using CPU and evaluating the prompt at roughly 3.5 tokens/s. Increasing the context size will likely make it more unusable. Increasing the context window also increases memory usage. You can set the context size with `PARAMETER num_ctx <size>` in the Modelfile or `{"options": {"num_ctx": <size>}}` in the generate request. ``` llama_model_load_internal: n_ctx = 2048 ```
Author
Owner

@technovangelist commented on GitHub (Dec 4, 2023):

It looks like Mike addressed the issue so I will go ahead and close it now. If you think there is anything we left out, reopen and we can address. Thanks for being part of this great community.

<!-- gh-comment-id:1839326048 --> @technovangelist commented on GitHub (Dec 4, 2023): It looks like Mike addressed the issue so I will go ahead and close it now. If you think there is anything we left out, reopen and we can address. Thanks for being part of this great community.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#222