[GH-ISSUE #1800] OOM errors for large context models can be solved by reducing 'num_batch' down from the default of 512 #63064

Closed
opened 2026-05-03 11:38:24 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @jukofyork on GitHub (Jan 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1800

Originally assigned to: @BruceMacD on GitHub.

I thought I'd post this here in case it helps others suffering from OOM errors as I searched and can see no mention of either "num_batch" or "n_batch" anywhere here.

I've been having endless problems with OOM errors when I try to run models with a context length of 16k like "deepseek-coder:33b-instruct" and originally thought it was due to this:

// 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors
layers := int(info.FreeMemory/bytesPerLayer) * 3 / 4

But whatever I set that to (even tiny fractions like 1 / 100), I would still eventually get an OOM error after inputting a lot of data to the 16k models... I could actually see the VRAM use go up using nvidia-smi in Linux until it hit the 24GB of my 4090 and then crash.

So next I tried "num_gpu=0" and this did work (I still got the benefit of the cuBLAS for the prompt evaluation, but otherwise very slow generation...). As soon as I set this to even "num_gpu =1" then I would get an OOM error after inputting a lot of data (but still way less than 16k tokens) to the 16k models.

So I then went into the Ollama source and found there are some hidden "PARAMETER" settings not mentioned in "/docs/modelfile.md " that can be found in "api/types.go" and one of these is "num_batch" (which corresponds to "n_batch" in llama.cpp) and it turns out this is was the solution. The default value is 512 (which is inherited from llama.cpp) and I found that reducing it finally solved the OOT crash problem.

It looks like there may even be a relationship that it needs to be decreased by num_ctx/4096 (= 4 for the 16k context models), and this in turn could possibly have something to do with the 3 / 4 magic number in the code above and/or the fact tbat 4096 is a very common default context size?? Anyway, setting to 128 almost worked unless I deliberately fed in a file I have created that I know deepseek-coder:33b-instruct will tokenize into 16216 tokens... So I then reduced to 64 and have since fed this same file in 4-5 times using the chat completion API so the complete conversation is > 64k tokens and it still hasn't crashed yet (the poor thing had a meltdown after 64k tokens and just replied "I'm sorry, but I can't assist with that" though lol).

I suspect I could get even closer to 128 as it did almost work but atm I'm just leaving it at 64 to see how I get on...

It should be noted that num_batch has to be >=32 (as per the llama.cpp docs) or otherwise it won't use the cuBLAS kernels for prompt evaluations at all.

I suggest anybody suffering from similar OOM errors add this to their modelfiles, starting at 32:

PARAMETER num_batch 32

and keep doubling it until you get the OOM errors again.

Originally created by @jukofyork on GitHub (Jan 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1800 Originally assigned to: @BruceMacD on GitHub. I thought I'd post this here in case it helps others suffering from OOM errors as I searched and can see no mention of either "num_batch" or "n_batch" anywhere here. I've been having endless problems with OOM errors when I try to run models with a context length of 16k like "deepseek-coder:33b-instruct" and originally thought it was due to this: ``` // 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors layers := int(info.FreeMemory/bytesPerLayer) * 3 / 4 ``` But whatever I set that to (even tiny fractions like 1 / 100), I would still eventually get an OOM error after inputting a lot of data to the 16k models... I could actually see the VRAM use go up using nvidia-smi in Linux until it hit the 24GB of my 4090 and then crash. So next I tried "num_gpu=0" and this did work (I still got the benefit of the cuBLAS for the prompt evaluation, but otherwise very slow generation...). As soon as I set this to even "num_gpu =1" then I would get an OOM error after inputting a lot of data (but still way less than 16k tokens) to the 16k models. So I then went into the Ollama source and found there are some hidden "PARAMETER" settings not mentioned in "/docs/modelfile.md " that can be found in "api/types.go" and one of these is "num_batch" (which corresponds to "n_batch" in llama.cpp) and it turns out this is was the solution. The default value is 512 (which is inherited from llama.cpp) and I found that reducing it finally solved the OOT crash problem. It looks like there may even be a relationship that it needs to be decreased by num_ctx/4096 (= 4 for the 16k context models), and this in turn could possibly have something to do with the 3 / 4 magic number in the code above and/or the fact tbat 4096 is a very common default context size?? Anyway, setting to 128 *almost* worked unless I deliberately fed in a file I have created that I know deepseek-coder:33b-instruct will tokenize into 16216 tokens... So I then reduced to 64 and have since fed this same file in 4-5 times using the chat completion API so the complete conversation is > 64k tokens and it still hasn't crashed yet (the poor thing had a meltdown after 64k tokens and just replied "I'm sorry, but I can't assist with that" though lol). I suspect I could get even closer to 128 as it did almost work but atm I'm just leaving it at 64 to see how I get on... It should be noted that num_batch has to be >=32 (as per the llama.cpp docs) or otherwise it won't use the cuBLAS kernels for prompt evaluations at all. I suggest anybody suffering from similar OOM errors add this to their modelfiles, starting at 32: ```PARAMETER num_batch 32``` and keep doubling it until you get the OOM errors again.
Author
Owner

@jukofyork commented on GitHub (Jan 5, 2024):

Just a quick update on other models that have different architectures.

Again I'm using my test file of ~16k tokens, a setting of num_batch=64 on a Debian 12 with 64GB ram + a 4090 with 24GB VRAM:

  • codellama:34b-instruct with 16k context - passed.
  • yi:34b-chat with 16k context - passed.
  • mixtral:8x7b-instruct-v0.1 with 32k context and was fed the file 2x - passed.

I will try deepseek-llm:67b-chat with it's context extended to 16k tomorrow and report back. I' don't have any other base models I can test on, but pretty sure I've solved my OOM problems now. nvidia-smi is showing around 21-23GB used of the 24GB at all times and it seems that I can now repeatedly fill the context until my LLMs have a meltdown 🤣

<!-- gh-comment-id:1878086128 --> @jukofyork commented on GitHub (Jan 5, 2024): Just a quick update on other models that have different architectures. Again I'm using my test file of ~16k tokens, a setting of `num_batch=64` on a Debian 12 with 64GB ram + a 4090 with 24GB VRAM: - `codellama:34b-instruct` with 16k context - passed. - `yi:34b-chat` with 16k context - passed. - `mixtral:8x7b-instruct-v0.1` with 32k context and was fed the file 2x - passed. I will try `deepseek-llm:67b-chat` with it's context extended to 16k tomorrow and report back. I' don't have any other base models I can test on, but pretty sure I've solved my OOM problems now. nvidia-smi is showing around 21-23GB used of the 24GB at all times and it seems that I can now repeatedly fill the context until my LLMs have a meltdown :rofl:
Author
Owner

@mongolu commented on GitHub (Jan 5, 2024):

I thought I'd post this here in case it helps others suffering from OOM errors as I searched and can see no mention of either "num_batch" or "n_batch" anywhere here.

I've been having endless problems with OOM errors when I try to run models with a context length of 16k like "deepseek-coder:33b-instruct" and originally thought it was due to this:

// 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors
layers := int(info.FreeMemory/bytesPerLayer) * 3 / 4

But whatever I set that to (even tiny fractions like 1 / 100), I would still eventually get an OOM error after inputting a lot of data to the 16k models... I could actually see the VRAM use go up using nvidia-smi in Linux until it hit the 24GB of my 4090 and then crash.

So next I tried "num_gpu=0" and this did work (I still got the benefit of the cuBLAS for the prompt evaluation, but otherwise very slow generation...). As soon as I set this to even "num_gpu =1" then I would get an OOM error after inputting a lot of data (but still way less than 16k tokens) to the 16k models.

So I then went into the Ollama source and found there are some hidden "PARAMETER" settings not mentioned in "/docs/modelfile.md " that can be found in "api/types.go" and one of these is "num_batch" (which corresponds to "n_batch" in llama.cpp) and it turns out this is was the solution. The default value is 512 (which is inherited from llama.cpp) and I found that reducing it finally solved the OOT crash problem.

It looks like there may even be a relationship that it needs to be decreased by num_ctx/4096 (= 4 for the 16k context models), and this in turn could possibly have something to do with the 3 / 4 magic number in the code above and/or the fact tbat 4096 is a very common default context size?? Anyway, setting to 128 almost worked unless I deliberately fed in a file I have created that I know deepseek-coder:33b-instruct will tokenize into 16216 tokens... So I then reduced to 64 and have since fed this same file in 4-5 times using the chat completion API so the complete conversation is > 64k tokens and it still hasn't crashed yet (the poor thing had a meltdown after 64k tokens and just replied "I'm sorry, but I can't assist with that" though lol).

I suspect I could get even closer to 128 as it did almost work but atm I'm just leaving it at 64 to see how I get on...

It should be noted that num_batch has to be >=32 (as per the llama.cpp docs) or otherwise it won't use the cuBLAS kernels for prompt evaluations at all.

I suggest anybody suffering from similar OOM errors add this to their modelfiles, starting at 32:

PARAMETER num_batch 32

and keep doubling it until you get the OOM errors again.

Niceee!
10x, it resolved my problem (bumping into this too, oftenly).
I use 64 for num_batch now.

<!-- gh-comment-id:1878463180 --> @mongolu commented on GitHub (Jan 5, 2024): > I thought I'd post this here in case it helps others suffering from OOM errors as I searched and can see no mention of either "num_batch" or "n_batch" anywhere here. > > I've been having endless problems with OOM errors when I try to run models with a context length of 16k like "deepseek-coder:33b-instruct" and originally thought it was due to this: > > ``` > // 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors > layers := int(info.FreeMemory/bytesPerLayer) * 3 / 4 > ``` > > But whatever I set that to (even tiny fractions like 1 / 100), I would still eventually get an OOM error after inputting a lot of data to the 16k models... I could actually see the VRAM use go up using nvidia-smi in Linux until it hit the 24GB of my 4090 and then crash. > > So next I tried "num_gpu=0" and this did work (I still got the benefit of the cuBLAS for the prompt evaluation, but otherwise very slow generation...). As soon as I set this to even "num_gpu =1" then I would get an OOM error after inputting a lot of data (but still way less than 16k tokens) to the 16k models. > > So I then went into the Ollama source and found there are some hidden "PARAMETER" settings not mentioned in "/docs/modelfile.md " that can be found in "api/types.go" and one of these is "num_batch" (which corresponds to "n_batch" in llama.cpp) and it turns out this is was the solution. The default value is 512 (which is inherited from llama.cpp) and I found that reducing it finally solved the OOT crash problem. > > It looks like there may even be a relationship that it needs to be decreased by num_ctx/4096 (= 4 for the 16k context models), and this in turn could possibly have something to do with the 3 / 4 magic number in the code above and/or the fact tbat 4096 is a very common default context size?? Anyway, setting to 128 _almost_ worked unless I deliberately fed in a file I have created that I know deepseek-coder:33b-instruct will tokenize into 16216 tokens... So I then reduced to 64 and have since fed this same file in 4-5 times using the chat completion API so the complete conversation is > 64k tokens and it still hasn't crashed yet (the poor thing had a meltdown after 64k tokens and just replied "I'm sorry, but I can't assist with that" though lol). > > I suspect I could get even closer to 128 as it did almost work but atm I'm just leaving it at 64 to see how I get on... > > It should be noted that num_batch has to be >=32 (as per the llama.cpp docs) or otherwise it won't use the cuBLAS kernels for prompt evaluations at all. > > I suggest anybody suffering from similar OOM errors add this to their modelfiles, starting at 32: > > `PARAMETER num_batch 32` > > and keep doubling it until you get the OOM errors again. Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now.
Author
Owner

@jukofyork commented on GitHub (Jan 5, 2024):

Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now.

Can you run a test and see if leaving it as 512 and setting num_gpu=1still crashes for you?

I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself...

If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with num_gpu=1 and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more.

I've got to go out but I think we can also refine the * 3 / 4 magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to Gpu.go to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it.

<!-- gh-comment-id:1878583355 --> @jukofyork commented on GitHub (Jan 5, 2024): > Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now. Can you run a test and see if leaving it as 512 and setting `num_gpu=1`still crashes for you? I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself... If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with `num_gpu=1` and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more. I've got to go out but I think we can also refine the `* 3 / 4` magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to `Gpu.go` to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it.
Author
Owner

@jukofyork commented on GitHub (Jan 5, 2024):

I've got to go out but I think we can also refine the * 3 / 4 magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to Gpu.go to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it.

I can confirm this page has the correct formula for calculating the KV cache:

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices

KV cache size = batch_size * seqlen * (d_model/n_heads) * n_layers * 2 * 2 * n_kv_heads

I did this calculation by hand for (IIRC) Llama-70b and a context length of 2048:

batch_size = 1 (NOTE: this is a different batch size to what we have here and is about serving multiple users from a A100)
d_model = 8192
n_heads = 64
n_layers = 80
n_kv_heads = 8

(1 * 2048 * (8192/64) * 80 * 2 * 2 * 8) / 1024^2 = 640MB

and this is exactly the same as what llama.cpp::main prints towards the bottom of it's output when run.

There are several wrong formulas floating about too:

https://old.reddit.com/r/LocalLLaMA/comments/1848puo/relationship_of_ram_to_context_size/
https://old.reddit.com/r/LocalLLaMA/comments/15825bt/how_much_ram_is_needed_for_llama2_70b_32k_context/
https://www.baseten.co/blog/llm-transformer-inference-guide/

Currently the function at the bottom of Gpu.go only gets passed the size of the model and the n_layers value, but I assume it wouldn't be hard to change it to pass the other values from the GGUF file's header to it and do the proper calculation? IIRC, when I looked at the output of llama.cpp::main some things like d_model were named differently to the formula above though.


This is from the new wizardcoder:33b-v1.1 model that is a fine-tune of deepseek-coder:33b-instruct which I just had the GGUF file handy for looking at:

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 32256
llm_load_print_meta: n_merges         = 31757
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_head           = 56
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 62
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 19200
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 100000.0
llm_load_print_meta: freq_scale_train = 0.25
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 33.34 B
llm_load_print_meta: model size       = 21.92 GiB (5.65 BPW) 
llm_load_print_meta: general.name     = wizardlm_wizardcoder-33b-v1.1
llm_load_print_meta: BOS token        = 32013 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 32014 '<|end▁of▁sentence|>'
llm_load_print_meta: UNK token        = 32022 '<unk>'
llm_load_print_meta: PAD token        = 32013 '<|begin▁of▁sentence|>'
llm_load_print_meta: LF token         = 126 'Ä'
llm_load_tensors: ggml ctx size =    0.21 MiB
llm_load_tensors: mem required  = 22444.54 MiB

d_model <--> n_embd = 7168 (which I think is also: n_head * n_rot)
n_heads <--> n_head = 56
n_layers <--> n_layer = 64
n_kv_heads <--> n_head_kv = 8

So redoing the calculation for a 16k context size:

KV cache size  = batch_size * seqlen * (d_model/n_heads) * n_layers * 2 * 2 * n_kv_heads
KV cache size  = (1 * 16384 * (7168/56) * 64 * 2 * 2 * 8) / 1024^2 = 4096MB

Using the 3/4 magic number from Gpu.go on my 4090:

24GB VRAM = 24×1024 = 24576MB

24576 * 3/4 = 18432MB

18432 + 4096 = 22528MB

When running the model on my ~16k token file (with num_batch=64), nvidia-smi is showing the same use the whole time (ie: for both prompt evaluation and for generation):

21892MB / 24564MB

and this ties in with the above as the integer division in the gpu::NumGPU() function will be rounding down the number of layers.

I don't really know enough about cuBLAS to know if it needs any VRAM to run the prompt evaluation though, but from this is doesn't look like it does (?).

These are the nvidia-smi stats for 8192 and 4096 contexts sizes for reference:

Using 8192 contexts size: 20480MB / 24564MB

Using 4096 contexts size: 20238MB / 24564MB

Which should have a KV cache size of 2048MB and 1024MB respectively, yet the Gpu.go function will just be allocating 3/4 of the 24GB for the offloaded layers and the extra VRAM must be getting used by cuBLAS (?).

So it's not 100% clear what's going on and it's probably worthwhile doing some benchmarks to see how to incorporate the KV cache size formula properly for those of us running with much smaller or much larger context sizes to utilize our VRAM as best as possible.

Anyway hope this is useful for somebody to work on refining the gpu::NumGPU() calculation eventually.


I just tested a 32k context model and right enough it did crash with this error:

Error: Post "http://127.0.0.1:11434/api/generate": EOF

So quite clearly gpu::NumGPU() should be dynamically calculating the layers better and the 3/4 magic number is only working through luck most of the time (and possibly wasting VRAM for those running with < 4096 context too...).

So looking at the code to see how hard it would be to change:

llm::New() has access to ggml which contains the required variables. Then the chain goes:

llm::New() --> llm::newLlmServer() --> ext_server::newDefaultExtServer() -->  ext_server::newExtServer() --> gpu::NumGPU()

gpu::NumGPU() would also need to be passed the context length, but in ext_server::newExtServer() it gets this a couple of lines down anyway:

numGPU := gpu.NumGPU(numLayers, fileInfo.Size(), opts)
.
.
.
sparams.n_ctx = C.uint(opts.NumCtx)

In gpu::NumGPU() you would need to use the formula above (possibly with some extra subtracted for cuBLAS as mentioned).

I'd do a pull request but I know nothing about Go and it would probably be a bodge-job considering so many different variables need passing up the chain... I think the best solution might be to calculate the KV cache size for a context length of 1, pass this up the chain to ext_server::newExtServer() , multiply it by the sparams.n_ctx value and then pass this as an extra parameter to gpu::NumGPU() to use. Hopefully somebody can try this, but if not I'll have a go, but would be much happier if somebody familiar with the codebase and Go did it.

<!-- gh-comment-id:1878955910 --> @jukofyork commented on GitHub (Jan 5, 2024): > I've got to go out but I think we can also refine the `* 3 / 4` magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to `Gpu.go` to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it. I can confirm this page has the correct formula for calculating the KV cache: https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices ```KV cache size = batch_size * seqlen * (d_model/n_heads) * n_layers * 2 * 2 * n_kv_heads``` I did this calculation by hand for (IIRC) Llama-70b and a context length of 2048: batch_size = 1 *(NOTE: this is a different batch size to what we have here and is about serving multiple users from a A100)* d_model = 8192 n_heads = 64 n_layers = 80 n_kv_heads = 8 (1 * 2048 * (8192/64) * 80 * 2 * 2 * 8) / 1024^2 = 640MB and this is exactly the same as what llama.cpp::main prints towards the bottom of it's output when run. There are several wrong formulas floating about too: https://old.reddit.com/r/LocalLLaMA/comments/1848puo/relationship_of_ram_to_context_size/ https://old.reddit.com/r/LocalLLaMA/comments/15825bt/how_much_ram_is_needed_for_llama2_70b_32k_context/ https://www.baseten.co/blog/llm-transformer-inference-guide/ Currently the function at the bottom of Gpu.go only gets passed the size of the model and the `n_layers` value, but I assume it wouldn't be hard to change it to pass the other values from the GGUF file's header to it and do the proper calculation? IIRC, when I looked at the output of llama.cpp::main some things like `d_model` were named differently to the formula above though. ---------------- This is from the new **wizardcoder:33b-v1.1** model that is a fine-tune of **deepseek-coder:33b-instruct** which I just had the GGUF file handy for looking at: ``` llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 32256 llm_load_print_meta: n_merges = 31757 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_head = 56 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 62 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 19200 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 100000.0 llm_load_print_meta: freq_scale_train = 0.25 llm_load_print_meta: n_yarn_orig_ctx = 16384 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 33.34 B llm_load_print_meta: model size = 21.92 GiB (5.65 BPW) llm_load_print_meta: general.name = wizardlm_wizardcoder-33b-v1.1 llm_load_print_meta: BOS token = 32013 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 32014 '<|end▁of▁sentence|>' llm_load_print_meta: UNK token = 32022 '<unk>' llm_load_print_meta: PAD token = 32013 '<|begin▁of▁sentence|>' llm_load_print_meta: LF token = 126 'Ä' llm_load_tensors: ggml ctx size = 0.21 MiB llm_load_tensors: mem required = 22444.54 MiB ``` d_model <--> n_embd = 7168 *(which I think is also: n_head * n_rot)* n_heads <--> n_head = 56 n_layers <--> n_layer = 64 n_kv_heads <--> n_head_kv = 8 So redoing the calculation for a 16k context size: ``` KV cache size = batch_size * seqlen * (d_model/n_heads) * n_layers * 2 * 2 * n_kv_heads KV cache size = (1 * 16384 * (7168/56) * 64 * 2 * 2 * 8) / 1024^2 = 4096MB ``` Using the 3/4 magic number from Gpu.go on my 4090: 24GB VRAM = 24×1024 = 24576MB 24576 * 3/4 = 18432MB 18432 + 4096 = 22528MB When running the model on my ~16k token file (with num_batch=64), **nvidia-smi** is showing the same use the whole time (ie: for both prompt evaluation and for generation): 21892MB / 24564MB and this ties in with the above as the integer division in the `gpu::NumGPU()` function will be rounding down the number of layers. I don't really know enough about cuBLAS to know if it needs any VRAM to run the prompt evaluation though, but from this is doesn't look like it does (?). These are the **nvidia-smi** stats for 8192 and 4096 contexts sizes for reference: Using 8192 contexts size: 20480MB / 24564MB Using 4096 contexts size: 20238MB / 24564MB Which should have a KV cache size of 2048MB and 1024MB respectively, yet the Gpu.go function will just be allocating 3/4 of the 24GB for the offloaded layers and the extra VRAM must be getting used by cuBLAS (?). So it's not 100% clear what's going on and it's probably worthwhile doing some benchmarks to see how to incorporate the KV cache size formula properly for those of us running with much smaller or much larger context sizes to utilize our VRAM as best as possible. Anyway hope this is useful for somebody to work on refining the `gpu::NumGPU()` calculation eventually. ---------------- I just tested a 32k context model and right enough it did crash with this error: ```Error: Post "http://127.0.0.1:11434/api/generate": EOF``` So quite clearly `gpu::NumGPU()` should be dynamically calculating the layers better and the 3/4 magic number is only working through luck most of the time (and possibly wasting VRAM for those running with < 4096 context too...). So looking at the code to see how hard it would be to change: `llm::New()` has access to `ggml` which contains the required variables. Then the chain goes: ``` llm::New() --> llm::newLlmServer() --> ext_server::newDefaultExtServer() --> ext_server::newExtServer() --> gpu::NumGPU() ``` `gpu::NumGPU()` would also need to be passed the context length, but in `ext_server::newExtServer()` it gets this a couple of lines down anyway: ``` numGPU := gpu.NumGPU(numLayers, fileInfo.Size(), opts) . . . sparams.n_ctx = C.uint(opts.NumCtx) ``` In `gpu::NumGPU()` you would need to use the formula above (possibly with some extra subtracted for cuBLAS as mentioned). I'd do a pull request but I know nothing about Go and it would probably be a bodge-job considering so many different variables need passing up the chain... I think the best solution might be to calculate the KV cache size for a context length of 1, pass this up the chain to `ext_server::newExtServer() `, multiply it by the `sparams.n_ctx ` value and then pass this as an extra parameter to `gpu::NumGPU()` to use. Hopefully somebody can try this, but if not I'll have a go, but would be much happier if somebody familiar with the codebase and Go did it.
Author
Owner

@jukofyork commented on GitHub (Jan 5, 2024):

Back to the original problem... I've found a good way to find the optimal value of num_batch:

  • Set num_gpu manually to something fairly conservative so it's using around 1/2 to 3/4 of your GPU's VRAM.
  • Create a huge file with at least 2x more tokens than context and feed it in as a prompt using the Ollama command line.
  • Load up nvidia-smi and watch the VRAM usage.

The VRAM usage should go up rapidly at the start and then stabilize all the way through processing the huge file.

Write down the VRAM usage from nvidia-smi when it settles and then wait until it either crashes OOM or the prompt evaluation stage is over and it starts outputting text (likely to be gibberish or it might just end without saying anything, because you've overloaded the context...).

If you have set num_batch too high then the VRAM usage will have gone up by now (assuming it hasn't crashed OOM already).

Try to find 2 values where one works and the other doesn't and just keep bisecting them:

[64, 128] --> (64+128)/2 = 96 [BAD]

[64,96] --> (64+96)/2) = 80 [GOOD]

[80,96] --> (80+96)/2 = 88 ...

and so on.

Eventually you will find the sweet spot where you can't raise it anymore without VRAM starting to leak.

Then leave num_batch fixed at the good value and start raising num_gpu until you get OOM errors (this should happen as soon as the model loads now).

You should then have optimal num_batch and num_gpu settings for that particular model and any fine-tunes of it.

I've just done this with deepseek-coder:33b-instruct and got num_batch = 86 and num_gpu = 52:

I'm sorry for any confusion, but it appears you have posted multiple files with a single post. As per Stack Overflow guidelines, each file should be submitted separately.

However, here is your code combined into one file for easy reference:

🤣

It will be interesting to see if num_batch = 86 is constant for other base models like LLama 2 or Yi.


You might also want to kill the ollama process between each test as it's not clear sometimes if it's actually reloaded the new value and/or sometimes it seems to go into a CPU-only mode where it doesn't use cuBLAS at all (ie: GPU use stays at 0% in nvidia-smi and it takes an etremetely long time to run the prompt evaluation stage).

<!-- gh-comment-id:1879382545 --> @jukofyork commented on GitHub (Jan 5, 2024): Back to the original problem... I've found a good way to find the optimal value of `num_batch`: - Set `num_gpu` manually to something fairly conservative so it's using around 1/2 to 3/4 of your GPU's VRAM. - Create a huge file with at least 2x more tokens than context and feed it in as a prompt using the Ollama command line. - Load up `nvidia-smi` and watch the VRAM usage. The VRAM usage should go up rapidly at the start and then stabilize all the way through processing the huge file. Write down the VRAM usage from `nvidia-smi` when it settles and then wait until it either crashes OOM or the prompt evaluation stage is over and it starts outputting text (likely to be gibberish or it might just end without saying anything, because you've overloaded the context...). If you have set `num_batch` too high then the VRAM usage will have gone up by now (assuming it hasn't crashed OOM already). Try to find 2 values where one works and the other doesn't and just keep bisecting them: [64, 128] --> (64+128)/2 = 96 [BAD] [64,96] --> (64+96)/2) = 80 [GOOD] [80,96] --> (80+96)/2 = 88 ... and so on. Eventually you will find the sweet spot where you can't raise it anymore without VRAM starting to leak. Then leave `num_batch` fixed at the good value and start raising `num_gpu ` until you get OOM errors (this should happen as soon as the model loads now). You should then have optimal `num_batch` and `num_gpu ` settings for that particular model and any fine-tunes of it. I've just done this with `deepseek-coder:33b-instruct` and got `num_batch = 86` and `num_gpu = 52`: > I'm sorry for any confusion, but it appears you have posted multiple files with a single post. As per Stack Overflow guidelines, each file should be submitted separately. > > However, here is your code combined into one file for easy reference: :rofl: It will be interesting to see if `num_batch = 86` is constant for other base models like LLama 2 or Yi. ----------- You might also want to kill the ollama process between each test as it's not clear sometimes if it's actually reloaded the new value and/or sometimes it seems to go into a CPU-only mode where it doesn't use cuBLAS at all (ie: GPU use stays at 0% in `nvidia-smi` and it takes an *etremetely* long time to run the prompt evaluation stage).
Author
Owner

@mongolu commented on GitHub (Jan 6, 2024):

Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now.

Can you run a test and see if leaving it as 512 and setting num_gpu=1still crashes for you?

I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself...

If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with num_gpu=1 and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more.

I've got to go out but I think we can also refine the * 3 / 4 magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to Gpu.go to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it.

Before putting num_batch=64, i haven't had this param in modelfile, but I've tried with num_gpu=1 and still crashed.

Pretty impressive work you've done.
I'm sorry, i don't quite follow you, maybe others more experienced.
Right now, I'm happy that it works, without crashing, till now.

<!-- gh-comment-id:1879596922 --> @mongolu commented on GitHub (Jan 6, 2024): > > Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now. > > Can you run a test and see if leaving it as 512 and setting `num_gpu=1`still crashes for you? > > I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself... > > If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with `num_gpu=1` and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more. > > I've got to go out but I think we can also refine the `* 3 / 4` magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to `Gpu.go` to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it. Before putting num_batch=64, i haven't had this param in modelfile, but I've tried with num_gpu=1 and still crashed. Pretty impressive work you've done. I'm sorry, i don't quite follow you, maybe others more experienced. Right now, I'm happy that it works, without crashing, till now.
Author
Owner

@jukofyork commented on GitHub (Jan 6, 2024):

I've managed to tune for deekseek-coder, codelama and yi base models now and it seems really random with optimal values using a 16k context length ranging from 80 to 180.

It does seem that fine tuned versions have almost the same optimal value but not necessarily exactly the same, so I've chosen to round down to the previous multiple of 16 for safety.

I can run nearly anything with a context length of 4096 and default the batch size of 512, apart from Mixtral that needs 256.

Mixtral still leaks memory and crashes with a 32k context length on the lowest allowable batch size of 32 if I give it a really massive file.

I'm going to retry with Q8 and Q6_K models later and see if they are any different to the current Q5_K_M models - there is some chance these use a different code path in llama.cpp and might avoid whatever is leaking VRAM.

<!-- gh-comment-id:1879692265 --> @jukofyork commented on GitHub (Jan 6, 2024): I've managed to tune for deekseek-coder, codelama and yi base models now and it seems really random with optimal values using a 16k context length ranging from 80 to 180. It does seem that fine tuned versions have *almost* the same optimal value but not necessarily exactly the same, so I've chosen to round down to the previous multiple of 16 for safety. I can run nearly anything with a context length of 4096 and default the batch size of 512, apart from Mixtral that needs 256. Mixtral still leaks memory and crashes with a 32k context length on the lowest allowable batch size of 32 if I give it a really massive file. I'm going to retry with Q8 and Q6_K models later and see if they are any different to the current Q5_K_M models - there is some chance these use a different code path in llama.cpp and might avoid whatever is leaking VRAM.
Author
Owner

@jukofyork commented on GitHub (Jan 6, 2024):

Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now.

Can you run a test and see if leaving it as 512 and setting num_gpu=1still crashes for you?
I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself...
If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with num_gpu=1 and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more.
I've got to go out but I think we can also refine the * 3 / 4 magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to Gpu.go to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it.

Before putting num_batch=64, i haven't had this param in modelfile, but I've tried with num_gpu=1 and still crashed.

Pretty impressive work you've done. I'm sorry, i don't quite follow you, maybe others more experienced. Right now, I'm happy that it works, without crashing, till now.

Yeah, I was having to use num_gpu=0 and had really slow generation (but still fast prompt evaluation from using cuBLAS). I'm getting a lot more usable generation now but the prompt evaluation is slower than it was...

Until this gets fixed I'm going to have 2 copies of each model: a 4k context with 512 batch size and a 16k context with the maximum non-OOM batch size, and choose between then based on the task (4k for small discussion prompts and 16k for large sourcecode ingestion prompts).

<!-- gh-comment-id:1879693970 --> @jukofyork commented on GitHub (Jan 6, 2024): > > > Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now. > > > > > > Can you run a test and see if leaving it as 512 and setting `num_gpu=1`still crashes for you? > > I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself... > > If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with `num_gpu=1` and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more. > > I've got to go out but I think we can also refine the `* 3 / 4` magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to `Gpu.go` to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it. > > Before putting num_batch=64, i haven't had this param in modelfile, but I've tried with num_gpu=1 and still crashed. > > Pretty impressive work you've done. I'm sorry, i don't quite follow you, maybe others more experienced. Right now, I'm happy that it works, without crashing, till now. Yeah, I was having to use num_gpu=0 and had really slow generation (but still fast prompt evaluation from using cuBLAS). I'm getting a lot more usable generation now but the prompt evaluation is slower than it was... Until this gets fixed I'm going to have 2 copies of each model: a 4k context with 512 batch size and a 16k context with the maximum non-OOM batch size, and choose between then based on the task (4k for small discussion prompts and 16k for large sourcecode ingestion prompts).
Author
Owner

@jukofyork commented on GitHub (Jan 6, 2024):

Update:

Tried deepseek-coder:33b-instruct-Q8_0 and same problem...

<!-- gh-comment-id:1879776167 --> @jukofyork commented on GitHub (Jan 6, 2024): Update: Tried `deepseek-coder:33b-instruct-Q8_0` and same problem...
Author
Owner

@jukofyork commented on GitHub (Jan 8, 2024):

Update: I've just moved not to using lower K-quant models if I want > 4k context. This buffer leak seems to only happen when increasing the context. I can still run 4k context models fine using mix of CPU and GPU.

<!-- gh-comment-id:1881914674 --> @jukofyork commented on GitHub (Jan 8, 2024): Update: I've just moved not to using lower K-quant models if I want > 4k context. This buffer leak seems to only happen when increasing the context. I can still run 4k context models fine using mix of CPU and GPU.
Author
Owner

@jmorganca commented on GitHub (Mar 12, 2024):

Hi folks if it's okay I'm going to merge this with the ongoing OOM + batch size issue: #1952

<!-- gh-comment-id:1989677526 --> @jmorganca commented on GitHub (Mar 12, 2024): Hi folks if it's okay I'm going to merge this with the ongoing OOM + batch size issue: #1952
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63064