[GH-ISSUE #1359] 4 GPUs, each with 12.2MiB. The utility loads more into rank 0, but it only gets up to about 4 plus GiB never close to 12.2GiB #62749

Closed
opened 2026-05-03 10:10:40 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @phalexo on GitHub (Dec 3, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1359

Originally assigned to: @dhiltgen on GitHub.

cuBLAS error 15 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7586
current device: 0
⠸ 2023/12/02 22:53:21 llama.go:436: exit status 1
2023/12/02 22:53:21 llama.go:510: llama runner stopped successfully
[GIN] 2023/12/02 - 22:53:21 | 200 | 1.311500885s | 127.0.0.1 | POST "/api/generate"
Error: llama runner exited, you may not have enough available memory to run this model

The model in question is orca-2-13b.Q6_K:latest. A 6 bit quantized model, which I converted using ollama instructions.

EDIT: I have now also tried it with "mistral" doing the standard download via
ollama run mistral. When I enter something, it either produces lines of "####...." or fails altogether and dies.

The original model file is GGUF V3.

The converted size is about 10GiB

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:04:00.0 Off |                  N/A |
| 22%   15C    P8    33W / 275W |   4774MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:05:00.0 Off |                  N/A |
| 22%   16C    P8    32W / 275W |   2979MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:08:00.0 Off |                  N/A |
| 22%   14C    P8    32W / 275W |   2979MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:09:00.0 Off |                  N/A |
| 22%   13C    P8    32W / 275W |   2979MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:85:00.0 Off |                  N/A |
|  0%   19C    P8     7W / 177W |      0MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
Originally created by @phalexo on GitHub (Dec 3, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1359 Originally assigned to: @dhiltgen on GitHub. cuBLAS error 15 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7586 current device: 0 ⠸ 2023/12/02 22:53:21 llama.go:436: exit status 1 2023/12/02 22:53:21 llama.go:510: llama runner stopped successfully [GIN] 2023/12/02 - 22:53:21 | 200 | 1.311500885s | 127.0.0.1 | POST "/api/generate" Error: llama runner exited, you may not have enough available memory to run this model The model in question is orca-2-13b.Q6_K:latest. A 6 bit quantized model, which I converted using ollama instructions. EDIT: I have now also tried it with "mistral" doing the standard download via ollama run mistral. When I enter something, it either produces lines of "####...." or fails altogether and dies. The original model file is GGUF V3. The converted size is about 10GiB ``` +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:04:00.0 Off | N/A | | 22% 15C P8 33W / 275W | 4774MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:05:00.0 Off | N/A | | 22% 16C P8 32W / 275W | 2979MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... On | 00000000:08:00.0 Off | N/A | | 22% 14C P8 32W / 275W | 2979MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... On | 00000000:09:00.0 Off | N/A | | 22% 13C P8 32W / 275W | 2979MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA GeForce ... On | 00000000:85:00.0 Off | N/A | | 0% 19C P8 7W / 177W | 0MiB / 4096MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ ```
GiteaMirror added the bug label 2026-05-03 10:10:40 -05:00
Author
Owner

@BruceMacD commented on GitHub (Dec 4, 2023):

Looks like another instance of a multi-gpu bug that has been on-going. Linking for future reference: https://github.com/jmorganca/ollama/issues/969

<!-- gh-comment-id:1839681899 --> @BruceMacD commented on GitHub (Dec 4, 2023): Looks like another instance of a multi-gpu bug that has been on-going. Linking for future reference: https://github.com/jmorganca/ollama/issues/969
Author
Owner

@phalexo commented on GitHub (Dec 4, 2023):

I dropped the version to 0.1.11 and it started working.

On Mon, Dec 4, 2023, 6:02 PM Bruce MacDonald @.***>
wrote:

Looks like another instance of a multi-gpu bug that has been on-going.
Linking for future reference: #969
https://github.com/jmorganca/ollama/issues/969


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1359#issuecomment-1839681899,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZPKPFX7IC2EYV7WTUDYHZJCHAVCNFSM6AAAAABAEPOVT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZZGY4DCOBZHE
.
You are receiving this because you authored the thread.Message ID:
@.***>

<!-- gh-comment-id:1839687225 --> @phalexo commented on GitHub (Dec 4, 2023): I dropped the version to 0.1.11 and it started working. On Mon, Dec 4, 2023, 6:02 PM Bruce MacDonald ***@***.***> wrote: > Looks like another instance of a multi-gpu bug that has been on-going. > Linking for future reference: #969 > <https://github.com/jmorganca/ollama/issues/969> > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1359#issuecomment-1839681899>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZPKPFX7IC2EYV7WTUDYHZJCHAVCNFSM6AAAAABAEPOVT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZZGY4DCOBZHE> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >
Author
Owner

@dhiltgen commented on GitHub (Jan 27, 2024):

We've improved our memory prediction calculations over the past few weeks. Please give 0.1.22 a try and see if you're still seeing the problem.

<!-- gh-comment-id:1912911524 --> @dhiltgen commented on GitHub (Jan 27, 2024): We've improved our memory prediction calculations over the past few weeks. Please give 0.1.22 a try and see if you're still seeing the problem.
Author
Owner

@dhiltgen commented on GitHub (Feb 1, 2024):

If you're still having problems with 0.1.22 or newer, please re-open.

<!-- gh-comment-id:1922456680 --> @dhiltgen commented on GitHub (Feb 1, 2024): If you're still having problems with 0.1.22 or newer, please re-open.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62749