[GH-ISSUE #1812] IMPROVEMENT: Proper calcuation of the KV cache size inside of gpu::NumGPU() instead of the 3/4 magic number... #26794

Closed
opened 2026-04-22 03:23:54 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @jukofyork on GitHub (Jan 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1812

Originally assigned to: @BruceMacD on GitHub.

See: https://github.com/jmorganca/ollama/issues/1800#issuecomment-1878955910

Feel free to pull out the stuff from that thread - it's only in there as I did quite a lot of research on this to try to figure out the OOM errors.

Originally created by @jukofyork on GitHub (Jan 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1812 Originally assigned to: @BruceMacD on GitHub. See: https://github.com/jmorganca/ollama/issues/1800#issuecomment-1878955910 Feel free to pull out the stuff from that thread - it's only in there as I did quite a lot of research on this to try to figure out the OOM errors.
GiteaMirror added the feature request label 2026-04-22 03:23:54 -05:00
Author
Owner

@jukofyork commented on GitHub (Jan 5, 2024):

Can a mod pull the discussion out of the other thread about the KV cache size into here?


Anyway, it seems that llama.cpp arbitarity uses a 512mb scratch buffer for the cuBLAS calculation:

llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer

I've also just confirmed this empirically with the following test:

So in the other thread I showed how to calculate that deepseek-coder:6.7b-instruct needs exactly 4096GB KV cache for a 16k context. Then subtracting off the 512MB scratch buffer:

// 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors
//layers := int(info.FreeMemory/bytesPerLayer) * 3 / 4
layers := int((info.FreeMemory-4294967296-536870912)/bytesPerLayer)

From nvidia-smi this is using: 24036MiB / 24564MiB.

(With the difference likely being due to rounding down the number of layers)

If I subtract 1024MB from the above instead I got left with 520MB free VRAM so it does indeed look like llama.cpp is using exactly 512 MB VRAM for the cuBLAS prompt evaluation and it's unrelated to batch_size (so long as n_batch >= 32).

But on the other hand if I try to do -8589934592-536870912 and run deepseek-coder:6.7b-instruct with a 32k context the Ollama CLI exits with a "Error: Post "http://127.0.0.1:11434/api/generate": EOF" as though it has got OOM, so possibly this needs looking at more carefully (it could be because I'm also pushing the 64GB of system RAM or something too...).


EDIT Actually I've just seen it says allocating batch_size x 1 MB and I was using a batch size of 64 so the above obviously isn't correct...

<!-- gh-comment-id:1879173016 --> @jukofyork commented on GitHub (Jan 5, 2024): *Can a mod pull the discussion out of the other thread about the KV cache size into here?* --------------- Anyway, it seems that llama.cpp arbitarity uses a 512mb scratch buffer for the cuBLAS calculation: ``` llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer ``` I've also just confirmed this empirically with the following test: So in the other thread I showed how to calculate that `deepseek-coder:6.7b-instruct` needs exactly 4096GB KV cache for a 16k context. Then subtracting off the 512MB scratch buffer: ``` // 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors //layers := int(info.FreeMemory/bytesPerLayer) * 3 / 4 layers := int((info.FreeMemory-4294967296-536870912)/bytesPerLayer) ``` From `nvidia-smi ` this is using: 24036MiB / 24564MiB. (With the difference likely being due to rounding down the number of layers) If I subtract 1024MB from the above instead I got left with 520MB free VRAM so it does indeed look like llama.cpp is using exactly 512 MB VRAM for the cuBLAS prompt evaluation and it's unrelated to batch_size (so long as n_batch >= 32). But on the other hand if I try to do `-8589934592-536870912` and run `deepseek-coder:6.7b-instruct` with a 32k context the Ollama CLI exits with a "Error: Post "http://127.0.0.1:11434/api/generate": EOF" as though it has got OOM, so possibly this needs looking at more carefully (it could be because I'm also pushing the 64GB of system RAM or something too...). --------------- **EDIT** Actually I've just seen it says `allocating batch_size x 1 MB ` and I was using a batch size of 64 so the above obviously isn't correct...
Author
Owner

@jukofyork commented on GitHub (Jan 6, 2024):

Well I've tried looking through the current llama.cpp code to see if I can see exactly where this is getting calculated.

It looks like the code up until around the middle of 2023 was a lot clearer in general, but a lot of the recent changes have just created endless chains of function calls and it's not clear at all how it's creating the scratch buffer anymore.

I do worry that some of the wierd VRAM leaks will never be tracked down as the code it verging on impenetrable now :(

As it is then I think any attempt to improve on the 3/4 magic number is just as likely to cause problems as fix them...

<!-- gh-comment-id:1879686886 --> @jukofyork commented on GitHub (Jan 6, 2024): Well I've tried looking through the current llama.cpp code to see if I can see exactly where this is getting calculated. It looks like the code up until around the middle of 2023 was a lot clearer in general, but a lot of the recent changes have just created endless chains of function calls and it's not clear at all how it's creating the scratch buffer anymore. I do worry that some of the wierd VRAM leaks will never be tracked down as the code it verging on impenetrable now :( As it is then I think any attempt to improve on the 3/4 magic number is just as likely to cause problems as fix them...
Author
Owner

@valentimarco commented on GitHub (Jan 6, 2024):

Maybe the error below occurs for a memory leak issue?

cuBLAS error 15 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:8458

current device: 0

GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:8458: !"cuBLAS error"

I tried to understand this asserts, but i know very basic Cuda C

<!-- gh-comment-id:1879696405 --> @valentimarco commented on GitHub (Jan 6, 2024): Maybe the error below occurs for a memory leak issue? ``` cuBLAS error 15 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:8458 current device: 0 GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:8458: !"cuBLAS error" ``` I tried to understand this asserts, but i know very basic Cuda C
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26794