[GH-ISSUE #1368] Different behavior between running on the host versus running on GPUs. #26480

Closed
opened 2026-04-22 02:46:36 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @phalexo on GitHub (Dec 4, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1368

Originally assigned to: @dhiltgen on GitHub.

When running on the GPUs (one or more) the output is either one character or one, seemingly, unrelated word and then lines of '#"
It does it for a while. Sometimes it gets an exception in cuBLAS error 15 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7586

But the same model(s) run on my host and produce output.

Originally created by @phalexo on GitHub (Dec 4, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1368 Originally assigned to: @dhiltgen on GitHub. When running on the GPUs (one or more) the output is either one character or one, seemingly, unrelated word and then lines of '#" It does it for a while. Sometimes it gets an exception in cuBLAS error 15 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7586 But the same model(s) run on my host and produce output.
Author
Owner

@phalexo commented on GitHub (Dec 4, 2023):

I have rebuilt ollama from the cloned source, and I still have the same issue. Junk output when running on GPUs and an error/exception when I input a second query.

a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
anca#####################################################################################################################################################################g{"timestamp":1701707692,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":48378,"status":200,"method":"POST","path":"/completion","params":{}}
{"timestamp":1701707692,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":42320,"status":200,"method":"POST","path":"/tokenize","params":{}}
[GIN] 2023/12/04 - 11:34:52 | 200 | 21.627947963s | 127.0.0.1 | POST "/api/generate"

Hello World.
{"timestamp":1701707715,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":34652,"status":200,"method":"HEAD","path":"/","params":{}}
{"timestamp":1701707715,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":34652,"status":200,"method":"POST","path":"/detokenize","params":{}}

cuBLAS error 15 at /home/developer/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7586
current device: 0
⠸ 2023/12/04 11:35:16 llama.go:436: exit status 1
2023/12/04 11:35:16 llama.go:510: llama runner stopped successfully
[GIN] 2023/12/04 - 11:35:16 | 200 | 494.763097ms | 127.0.0.1 | POST "/api/generate"
Error: llama runner exited, you may not have enough available memory to run this model

<!-- gh-comment-id:1839031647 --> @phalexo commented on GitHub (Dec 4, 2023): I have rebuilt ollama from the cloned source, and I still have the same issue. Junk output when running on GPUs and an error/exception when I input a second query. a a a a a a a a a a a a a a a anca#####################################################################################################################################################################g{"timestamp":1701707692,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":48378,"status":200,"method":"POST","path":"/completion","params":{}} {"timestamp":1701707692,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":42320,"status":200,"method":"POST","path":"/tokenize","params":{}} [GIN] 2023/12/04 - 11:34:52 | 200 | 21.627947963s | 127.0.0.1 | POST "/api/generate" >>> >>> <s> Hello World. </s> {"timestamp":1701707715,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":34652,"status":200,"method":"HEAD","path":"/","params":{}} {"timestamp":1701707715,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":34652,"status":200,"method":"POST","path":"/detokenize","params":{}} cuBLAS error 15 at /home/developer/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7586 current device: 0 ⠸ 2023/12/04 11:35:16 llama.go:436: exit status 1 2023/12/04 11:35:16 llama.go:510: llama runner stopped successfully [GIN] 2023/12/04 - 11:35:16 | 200 | 494.763097ms | 127.0.0.1 | POST "/api/generate" Error: llama runner exited, you may not have enough available memory to run this model
Author
Owner

@BruceMacD commented on GitHub (Dec 4, 2023):

This is probably related to a multi-gpu bug that has been on-going. Linking for future reference: #969

<!-- gh-comment-id:1839675821 --> @BruceMacD commented on GitHub (Dec 4, 2023): This is probably related to a multi-gpu bug that has been on-going. Linking for future reference: #969
Author
Owner

@dhiltgen commented on GitHub (Jan 27, 2024):

Quite a few GPU related fixes have gone in over the past few weeks. Please give this another try with the latest release 0.1.22 and let us know if you're still hitting the problem.

<!-- gh-comment-id:1912911100 --> @dhiltgen commented on GitHub (Jan 27, 2024): Quite a few GPU related fixes have gone in over the past few weeks. Please give this another try with the latest release 0.1.22 and let us know if you're still hitting the problem.
Author
Owner

@dhiltgen commented on GitHub (Feb 1, 2024):

If you're still having problems with 0.1.22 or newer, please re-open.

<!-- gh-comment-id:1922457583 --> @dhiltgen commented on GitHub (Feb 1, 2024): If you're still having problems with 0.1.22 or newer, please re-open.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26480