[GH-ISSUE #1696] Setting 'num_gpu 0' shouldn't preclude the use of cuBLASS for prompt evaluation #62994

Closed
opened 2026-05-03 11:10:56 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @jukofyork on GitHub (Dec 24, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1696

Hi,

I've just moved from llama.cpp to ollama and my use case is to feed large prompts to high-parameter/high-quantization models for code evaluation, but I've found there to be quite a serious problem with ollama compared to llama.cpp:

With llama.cpp I am able to run up to 70b 'q6_K' or 'q5_K_M' parameter models on my system with 64gb RAM and 24gb VRAM by compiling with '-DLLAMA_BLAS=ON' and then running with '-ngl 0'. This allows me to use cuBLAS for the prompt evaluation on the GPU but the rest of the evaluation is run on the CPU:

A 70b parameter model will use a max of 50-60gb of system RAM (depending on the quantization level) and you can quite clearly see if offload the de-quantization and other work off to the GPU during the prompt evaluation.

With ollama:

  • If I set 'num_gpu' to 0 then nothing gets offloaded to the GPU at all and the prompt evaluation is done at unbearably slow speed on the CPU...
  • If I set 'num_gpu' to 1 or more then the work is offloaded to the GPU for prompt evaluation, but because of the way the wrapped llama.cpp's server works; it ends up with an extra unnecessary copy of the model stored in system RAM too!

I'm also getting lots of "out of memory" type crashes for models that get close to the 24gb VRAM limit but otherwise work fine using llama.cpp, but I see from reading the discussion here that this might just be related to the v0.1.14 changes (it doesn't bother me anyway as I'm only interested in speeding up the prompt evaluation).

Well, after 2 days of pulling my hair out trying to work out why none of my changes to the source seem to make any difference... I finally found out I had a copy of '/usr/share/bin/ollama serve' running from the stock installer all along 🤦

The problem lies with this code in 'llama.go':

if runner.Accelerated && numGPU == 0 {
    log.Printf("skipping accelerated runner because num_gpu=0")
    continue
}

If I comment that out and recompile then everything works as expected!

I didn't want to do a pull request as I've no idea of the logic behind this test or how it would effect others who are just using CPU only inference.

Originally created by @jukofyork on GitHub (Dec 24, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1696 Hi, I've just moved from llama.cpp to ollama and my use case is to feed large prompts to high-parameter/high-quantization models for code evaluation, but I've found there to be quite a serious problem with ollama compared to llama.cpp: With llama.cpp I am able to run up to 70b 'q6_K' or 'q5_K_M' parameter models on my system with 64gb RAM and 24gb VRAM by compiling with '-DLLAMA_BLAS=ON' and then running with '-ngl 0'. This allows me to use cuBLAS for the prompt evaluation on the GPU but the rest of the evaluation is run on the CPU: A 70b parameter model will use a max of 50-60gb of system RAM (depending on the quantization level) and you can quite clearly see if offload the de-quantization and other work off to the GPU during the prompt evaluation. With ollama: - If I set 'num_gpu' to 0 then nothing gets offloaded to the GPU at all and the prompt evaluation is done at unbearably slow speed on the CPU... - If I set 'num_gpu' to 1 or more then the work is offloaded to the GPU for prompt evaluation, but because of the way the wrapped llama.cpp's server works; it ends up with an extra unnecessary copy of the model stored in system RAM too! I'm also getting lots of "out of memory" type crashes for models that get close to the 24gb VRAM limit but otherwise work fine using llama.cpp, but I see from reading the discussion here that this might just be related to the v0.1.14 changes (it doesn't bother me anyway as I'm only interested in speeding up the prompt evaluation). Well, after 2 days of pulling my hair out trying to work out why none of my changes to the source seem to make any difference... I finally found out I had a copy of '/usr/share/bin/ollama serve' running from the stock installer all along :facepalm: The problem lies with this code in 'llama.go': ``` if runner.Accelerated && numGPU == 0 { log.Printf("skipping accelerated runner because num_gpu=0") continue } ``` If I comment that out and recompile then everything works as expected! I didn't want to do a pull request as I've no idea of the logic behind this test or how it would effect others who are just using CPU only inference.
Author
Owner

@jukofyork commented on GitHub (Dec 24, 2023):

Actually now I've just pulled the latest version and recompiled and this problem seems to have been fixed already! I was pulling v0.1.17 which I assume is the same as the stable version that is downloadable from the https://ollama.ai/download site. I'll close this and sorry for not trying the latest release first!

It's probably a good idea to get the https://ollama.ai/download version changes to reflect this change though as anybody else using for a similar use case to mine will be very put off when trying to set num_gpu to 0.

<!-- gh-comment-id:1868525634 --> @jukofyork commented on GitHub (Dec 24, 2023): Actually now I've just pulled the latest version and recompiled and this problem seems to have been fixed already! I was pulling v0.1.17 which I assume is the same as the stable version that is downloadable from the https://ollama.ai/download site. I'll close this and sorry for not trying the latest release first! It's probably a good idea to get the https://ollama.ai/download version changes to reflect this change though as anybody else using for a similar use case to mine will be very put off when trying to set num_gpu to 0.
Author
Owner

@mherrmann3 commented on GitHub (Feb 26, 2024):

Actually, it still behaves this way: setting num_gpu 0 precludes the use of cuBLAS, see corresponding condition in llm.go). So GPU-based prompt processing with (pure) CPU-based inference is not possible with ollama (only llama.cpp). Yes, one could set num_gpu 1, but partially offloading also slightly degrades inference speed (token/s) when the GPU is busy with other processes; this effect can be easily revealed with llama.cpp's benchmark:

./llama-bench -m <your preferred model.gguf> -p 0 -ngl 0,1

In my case with a busy GPU and Mixtral 8x7B:

model size params backend ngl test t/s
llama 7B Q4_K - Medium 24.62 GiB 46.70 B CUDA 0 tg 128 4.16 ± 0.01
llama 7B Q4_K - Medium 24.62 GiB 46.70 B CUDA 1 tg 128 3.97 ± 0.04

This is a minor performance loss of ~5%. For smaller models like 7B, it is only 0.5%, likely because the offloaded layer is much smaller. So I presume this effect worsens with increasing model/layer size.

@jukofyork I would like to revive this issue. It would be intuitive and beneficial if ollama behaves like llama.cpp when a GPU was detected and num_gpu 0: use the GPU, but don't offload any layers (this will use some VRAM, presumably due to KV cache offloading).


PS: Interestingly, the prompt processing is not slowed down with increasing layer offloading:

model size params backend ngl test t/s
llama 7B Q4_K - Medium 24.62 GiB 46.70 B CUDA 0 tg 128 84.04 ± 0.32
llama 7B Q4_K - Medium 24.62 GiB 46.70 B CUDA 1 tg 128 86.25 ± 0.31
... ... ... ... ... ... ...
llama 7B Q4_K - Medium 24.62 GiB 46.70 B CUDA 6 tg 128 97.06 ± 0.27
<!-- gh-comment-id:1964770636 --> @mherrmann3 commented on GitHub (Feb 26, 2024): Actually, it still behaves this way: setting `num_gpu 0` precludes the use of cuBLAS, see [corresponding condition in llm.go](https://github.com/ollama/ollama/blob/a189810df6c4b0492463d1ddb68993c9abc32c7f/llm/llm.go#L83)). So GPU-based prompt processing with (pure) CPU-based inference is not possible with ollama (only _llama.cpp_). Yes, one could set `num_gpu 1`, but partially offloading also slightly degrades inference speed (token/s) **when the GPU is busy with other processes;** this effect can be easily revealed with _llama.cpp_'s benchmark: ``` ./llama-bench -m <your preferred model.gguf> -p 0 -ngl 0,1 ``` In my case with a busy GPU and Mixtral 8x7B: | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: | | llama 7B Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 0 | tg 128 | 4.16 ± 0.01 | | llama 7B Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 1 | tg 128 | 3.97 ± 0.04 | This is a minor performance loss of ~5%. For smaller models like 7B, it is only 0.5%, likely because the offloaded layer is much smaller. So I presume this effect worsens with increasing model/layer size. @jukofyork I would like to revive this issue. It would be intuitive and beneficial if _ollama_ behaves like _llama.cpp_ when a GPU was detected and `num_gpu 0`: use the GPU, but don't offload any layers (this will use some VRAM, presumably due to KV cache offloading). --- PS: Interestingly, the prompt processing is not slowed down with increasing layer offloading: | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: | | llama 7B Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 0 | tg 128 | 84.04 ± 0.32 | | llama 7B Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 1 | tg 128 | 86.25 ± 0.31 | | ... | ... | ... | ... | ... | ... | ... | | llama 7B Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 6 | tg 128 | 97.06 ± 0.27 |
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62994