[GH-ISSUE #1800] OOM errors for large context models can be solved by reducing 'num_batch' down from the default of 512 #63064

New Issue

GiteaMirror · 2026-05-03T11:38:24-05:00

GiteaMirror commented

2026-05-03 11:38:24 -05:00

Originally created by @jukofyork on GitHub (Jan 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1800

Originally assigned to: @BruceMacD on GitHub.

I thought I'd post this here in case it helps others suffering from OOM errors as I searched and can see no mention of either "num_batch" or "n_batch" anywhere here.

I've been having endless problems with OOM errors when I try to run models with a context length of 16k like "deepseek-coder:33b-instruct" and originally thought it was due to this:

// 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors
layers := int(info.FreeMemory/bytesPerLayer) * 3 / 4

But whatever I set that to (even tiny fractions like 1 / 100), I would still eventually get an OOM error after inputting a lot of data to the 16k models... I could actually see the VRAM use go up using nvidia-smi in Linux until it hit the 24GB of my 4090 and then crash.

So next I tried "num_gpu=0" and this did work (I still got the benefit of the cuBLAS for the prompt evaluation, but otherwise very slow generation...). As soon as I set this to even "num_gpu =1" then I would get an OOM error after inputting a lot of data (but still way less than 16k tokens) to the 16k models.

So I then went into the Ollama source and found there are some hidden "PARAMETER" settings not mentioned in "/docs/modelfile.md " that can be found in "api/types.go" and one of these is "num_batch" (which corresponds to "n_batch" in llama.cpp) and it turns out this is was the solution. The default value is 512 (which is inherited from llama.cpp) and I found that reducing it finally solved the OOT crash problem.

It looks like there may even be a relationship that it needs to be decreased by num_ctx/4096 (= 4 for the 16k context models), and this in turn could possibly have something to do with the 3 / 4 magic number in the code above and/or the fact tbat 4096 is a very common default context size?? Anyway, setting to 128 almost worked unless I deliberately fed in a file I have created that I know deepseek-coder:33b-instruct will tokenize into 16216 tokens... So I then reduced to 64 and have since fed this same file in 4-5 times using the chat completion API so the complete conversation is > 64k tokens and it still hasn't crashed yet (the poor thing had a meltdown after 64k tokens and just replied "I'm sorry, but I can't assist with that" though lol).

I suspect I could get even closer to 128 as it did almost work but atm I'm just leaving it at 64 to see how I get on...

It should be noted that num_batch has to be >=32 (as per the llama.cpp docs) or otherwise it won't use the cuBLAS kernels for prompt evaluations at all.

I suggest anybody suffering from similar OOM errors add this to their modelfiles, starting at 32:

PARAMETER num_batch 32

and keep doubling it until you get the OOM errors again.

Originally created by @jukofyork on GitHub (Jan 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1800 Originally assigned to: @BruceMacD on GitHub. I thought I'd post this here in case it helps others suffering from OOM errors as I searched and can see no mention of either "num_batch" or "n_batch" anywhere here. I've been having endless problems with OOM errors when I try to run models with a context length of 16k like "deepseek-coder:33b-instruct" and originally thought it was due to this: ``` // 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors layers := int(info.FreeMemory/bytesPerLayer) * 3 / 4 ``` But whatever I set that to (even tiny fractions like 1 / 100), I would still eventually get an OOM error after inputting a lot of data to the 16k models... I could actually see the VRAM use go up using nvidia-smi in Linux until it hit the 24GB of my 4090 and then crash. So next I tried "num_gpu=0" and this did work (I still got the benefit of the cuBLAS for the prompt evaluation, but otherwise very slow generation...). As soon as I set this to even "num_gpu =1" then I would get an OOM error after inputting a lot of data (but still way less than 16k tokens) to the 16k models. So I then went into the Ollama source and found there are some hidden "PARAMETER" settings not mentioned in "/docs/modelfile.md " that can be found in "api/types.go" and one of these is "num_batch" (which corresponds to "n_batch" in llama.cpp) and it turns out this is was the solution. The default value is 512 (which is inherited from llama.cpp) and I found that reducing it finally solved the OOT crash problem. It looks like there may even be a relationship that it needs to be decreased by num_ctx/4096 (= 4 for the 16k context models), and this in turn could possibly have something to do with the 3 / 4 magic number in the code above and/or the fact tbat 4096 is a very common default context size?? Anyway, setting to 128 *almost* worked unless I deliberately fed in a file I have created that I know deepseek-coder:33b-instruct will tokenize into 16216 tokens... So I then reduced to 64 and have since fed this same file in 4-5 times using the chat completion API so the complete conversation is > 64k tokens and it still hasn't crashed yet (the poor thing had a meltdown after 64k tokens and just replied "I'm sorry, but I can't assist with that" though lol). I suspect I could get even closer to 128 as it did almost work but atm I'm just leaving it at 64 to see how I get on... It should be noted that num_batch has to be >=32 (as per the llama.cpp docs) or otherwise it won't use the cuBLAS kernels for prompt evaluations at all. I suggest anybody suffering from similar OOM errors add this to their modelfiles, starting at 32: ```PARAMETER num_batch 32``` and keep doubling it until you get the OOM errors again.

GiteaMirror closed this issue

2026-05-03 11:38:25 -05:00

GiteaMirror commented

2026-05-03 11:38:27 -05:00

@jukofyork commented on GitHub (Jan 5, 2024):

Just a quick update on other models that have different architectures.

Again I'm using my test file of ~16k tokens, a setting of num_batch=64 on a Debian 12 with 64GB ram + a 4090 with 24GB VRAM:

codellama:34b-instruct with 16k context - passed.
yi:34b-chat with 16k context - passed.
mixtral:8x7b-instruct-v0.1 with 32k context and was fed the file 2x - passed.

I will try deepseek-llm:67b-chat with it's context extended to 16k tomorrow and report back. I' don't have any other base models I can test on, but pretty sure I've solved my OOM problems now. nvidia-smi is showing around 21-23GB used of the 24GB at all times and it seems that I can now repeatedly fill the context until my LLMs have a meltdown 🤣

@jukofyork commented on GitHub (Jan 5, 2024): Just a quick update on other models that have different architectures. Again I'm using my test file of ~16k tokens, a setting of `num_batch=64` on a Debian 12 with 64GB ram + a 4090 with 24GB VRAM: - `codellama:34b-instruct` with 16k context - passed. - `yi:34b-chat` with 16k context - passed. - `mixtral:8x7b-instruct-v0.1` with 32k context and was fed the file 2x - passed. I will try `deepseek-llm:67b-chat` with it's context extended to 16k tomorrow and report back. I' don't have any other base models I can test on, but pretty sure I've solved my OOM problems now. nvidia-smi is showing around 21-23GB used of the 24GB at all times and it seems that I can now repeatedly fill the context until my LLMs have a meltdown :rofl:

GiteaMirror commented

2026-05-03 11:38:27 -05:00

@mongolu commented on GitHub (Jan 5, 2024):

I thought I'd post this here in case it helps others suffering from OOM errors as I searched and can see no mention of either "num_batch" or "n_batch" anywhere here.

I've been having endless problems with OOM errors when I try to run models with a context length of 16k like "deepseek-coder:33b-instruct" and originally thought it was due to this:
// 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors
layers := int(info.FreeMemory/bytesPerLayer) * 3 / 4
But whatever I set that to (even tiny fractions like 1 / 100), I would still eventually get an OOM error after inputting a lot of data to the 16k models... I could actually see the VRAM use go up using nvidia-smi in Linux until it hit the 24GB of my 4090 and then crash.

So next I tried "num_gpu=0" and this did work (I still got the benefit of the cuBLAS for the prompt evaluation, but otherwise very slow generation...). As soon as I set this to even "num_gpu =1" then I would get an OOM error after inputting a lot of data (but still way less than 16k tokens) to the 16k models.

So I then went into the Ollama source and found there are some hidden "PARAMETER" settings not mentioned in "/docs/modelfile.md " that can be found in "api/types.go" and one of these is "num_batch" (which corresponds to "n_batch" in llama.cpp) and it turns out this is was the solution. The default value is 512 (which is inherited from llama.cpp) and I found that reducing it finally solved the OOT crash problem.

It looks like there may even be a relationship that it needs to be decreased by num_ctx/4096 (= 4 for the 16k context models), and this in turn could possibly have something to do with the 3 / 4 magic number in the code above and/or the fact tbat 4096 is a very common default context size?? Anyway, setting to 128 almost worked unless I deliberately fed in a file I have created that I know deepseek-coder:33b-instruct will tokenize into 16216 tokens... So I then reduced to 64 and have since fed this same file in 4-5 times using the chat completion API so the complete conversation is > 64k tokens and it still hasn't crashed yet (the poor thing had a meltdown after 64k tokens and just replied "I'm sorry, but I can't assist with that" though lol).

I suspect I could get even closer to 128 as it did almost work but atm I'm just leaving it at 64 to see how I get on...

It should be noted that num_batch has to be >=32 (as per the llama.cpp docs) or otherwise it won't use the cuBLAS kernels for prompt evaluations at all.

I suggest anybody suffering from similar OOM errors add this to their modelfiles, starting at 32:

PARAMETER num_batch 32

and keep doubling it until you get the OOM errors again.

Niceee!
10x, it resolved my problem (bumping into this too, oftenly).
I use 64 for num_batch now.

@mongolu commented on GitHub (Jan 5, 2024): > I thought I'd post this here in case it helps others suffering from OOM errors as I searched and can see no mention of either "num_batch" or "n_batch" anywhere here. > > I've been having endless problems with OOM errors when I try to run models with a context length of 16k like "deepseek-coder:33b-instruct" and originally thought it was due to this: > > ``` > // 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors > layers := int(info.FreeMemory/bytesPerLayer) * 3 / 4 > ``` > > But whatever I set that to (even tiny fractions like 1 / 100), I would still eventually get an OOM error after inputting a lot of data to the 16k models... I could actually see the VRAM use go up using nvidia-smi in Linux until it hit the 24GB of my 4090 and then crash. > > So next I tried "num_gpu=0" and this did work (I still got the benefit of the cuBLAS for the prompt evaluation, but otherwise very slow generation...). As soon as I set this to even "num_gpu =1" then I would get an OOM error after inputting a lot of data (but still way less than 16k tokens) to the 16k models. > > So I then went into the Ollama source and found there are some hidden "PARAMETER" settings not mentioned in "/docs/modelfile.md " that can be found in "api/types.go" and one of these is "num_batch" (which corresponds to "n_batch" in llama.cpp) and it turns out this is was the solution. The default value is 512 (which is inherited from llama.cpp) and I found that reducing it finally solved the OOT crash problem. > > It looks like there may even be a relationship that it needs to be decreased by num_ctx/4096 (= 4 for the 16k context models), and this in turn could possibly have something to do with the 3 / 4 magic number in the code above and/or the fact tbat 4096 is a very common default context size?? Anyway, setting to 128 _almost_ worked unless I deliberately fed in a file I have created that I know deepseek-coder:33b-instruct will tokenize into 16216 tokens... So I then reduced to 64 and have since fed this same file in 4-5 times using the chat completion API so the complete conversation is > 64k tokens and it still hasn't crashed yet (the poor thing had a meltdown after 64k tokens and just replied "I'm sorry, but I can't assist with that" though lol). > > I suspect I could get even closer to 128 as it did almost work but atm I'm just leaving it at 64 to see how I get on... > > It should be noted that num_batch has to be >=32 (as per the llama.cpp docs) or otherwise it won't use the cuBLAS kernels for prompt evaluations at all. > > I suggest anybody suffering from similar OOM errors add this to their modelfiles, starting at 32: > > `PARAMETER num_batch 32` > > and keep doubling it until you get the OOM errors again. Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now.

GiteaMirror commented

2026-05-03 11:38:28 -05:00

@jukofyork commented on GitHub (Jan 5, 2024):

Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now.

Can you run a test and see if leaving it as 512 and setting num_gpu=1still crashes for you?

I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself...

If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with num_gpu=1 and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more.

I've got to go out but I think we can also refine the * 3 / 4 magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to Gpu.go to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it.

@jukofyork commented on GitHub (Jan 5, 2024): > Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now. Can you run a test and see if leaving it as 512 and setting `num_gpu=1`still crashes for you? I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself... If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with `num_gpu=1` and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more. I've got to go out but I think we can also refine the `* 3 / 4` magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to `Gpu.go` to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it.

GiteaMirror commented

2026-05-03 11:38:29 -05:00

@jukofyork commented on GitHub (Jan 5, 2024):

I've got to go out but I think we can also refine the * 3 / 4 magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to Gpu.go to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it.

I can confirm this page has the correct formula for calculating the KV cache:

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices

KV cache size = batch_size * seqlen * (d_model/n_heads) * n_layers * 2 * 2 * n_kv_heads

I did this calculation by hand for (IIRC) Llama-70b and a context length of 2048:

batch_size = 1 (NOTE: this is a different batch size to what we have here and is about serving multiple users from a A100)
d_model = 8192
n_heads = 64
n_layers = 80
n_kv_heads = 8

(1 * 2048 * (8192/64) * 80 * 2 * 2 * 8) / 1024^2 = 640MB

and this is exactly the same as what llama.cpp::main prints towards the bottom of it's output when run.

There are several wrong formulas floating about too:

https://old.reddit.com/r/LocalLLaMA/comments/1848puo/relationship_of_ram_to_context_size/
https://old.reddit.com/r/LocalLLaMA/comments/15825bt/how_much_ram_is_needed_for_llama2_70b_32k_context/
https://www.baseten.co/blog/llm-transformer-inference-guide/

Currently the function at the bottom of Gpu.go only gets passed the size of the model and the n_layers value, but I assume it wouldn't be hard to change it to pass the other values from the GGUF file's header to it and do the proper calculation? IIRC, when I looked at the output of llama.cpp::main some things like d_model were named differently to the formula above though.

This is from the new wizardcoder:33b-v1.1 model that is a fine-tune of deepseek-coder:33b-instruct which I just had the GGUF file handy for looking at:

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 32256
llm_load_print_meta: n_merges         = 31757
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_head           = 56
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 62
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 19200
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 100000.0
llm_load_print_meta: freq_scale_train = 0.25
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 33.34 B
llm_load_print_meta: model size       = 21.92 GiB (5.65 BPW) 
llm_load_print_meta: general.name     = wizardlm_wizardcoder-33b-v1.1
llm_load_print_meta: BOS token        = 32013 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 32014 '<｜end▁of▁sentence｜>'
llm_load_print_meta: UNK token        = 32022 '<unk>'
llm_load_print_meta: PAD token        = 32013 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: LF token         = 126 'Ä'
llm_load_tensors: ggml ctx size =    0.21 MiB
llm_load_tensors: mem required  = 22444.54 MiB

d_model <--> n_embd = 7168 (which I think is also: n_head * n_rot)
n_heads <--> n_head = 56
n_layers <--> n_layer = 64
n_kv_heads <--> n_head_kv = 8

So redoing the calculation for a 16k context size:

KV cache size  = batch_size * seqlen * (d_model/n_heads) * n_layers * 2 * 2 * n_kv_heads
KV cache size  = (1 * 16384 * (7168/56) * 64 * 2 * 2 * 8) / 1024^2 = 4096MB

Using the 3/4 magic number from Gpu.go on my 4090:

24GB VRAM = 24×1024 = 24576MB

24576 * 3/4 = 18432MB

18432 + 4096 = 22528MB

When running the model on my ~16k token file (with num_batch=64), nvidia-smi is showing the same use the whole time (ie: for both prompt evaluation and for generation):

21892MB / 24564MB

and this ties in with the above as the integer division in the gpu::NumGPU() function will be rounding down the number of layers.

I don't really know enough about cuBLAS to know if it needs any VRAM to run the prompt evaluation though, but from this is doesn't look like it does (?).

These are the nvidia-smi stats for 8192 and 4096 contexts sizes for reference:

Using 8192 contexts size: 20480MB / 24564MB

Using 4096 contexts size: 20238MB / 24564MB

Which should have a KV cache size of 2048MB and 1024MB respectively, yet the Gpu.go function will just be allocating 3/4 of the 24GB for the offloaded layers and the extra VRAM must be getting used by cuBLAS (?).

So it's not 100% clear what's going on and it's probably worthwhile doing some benchmarks to see how to incorporate the KV cache size formula properly for those of us running with much smaller or much larger context sizes to utilize our VRAM as best as possible.

Anyway hope this is useful for somebody to work on refining the gpu::NumGPU() calculation eventually.

I just tested a 32k context model and right enough it did crash with this error:

Error: Post "http://127.0.0.1:11434/api/generate": EOF

So quite clearly gpu::NumGPU() should be dynamically calculating the layers better and the 3/4 magic number is only working through luck most of the time (and possibly wasting VRAM for those running with < 4096 context too...).

So looking at the code to see how hard it would be to change:

llm::New() has access to ggml which contains the required variables. Then the chain goes:

llm::New() --> llm::newLlmServer() --> ext_server::newDefaultExtServer() -->  ext_server::newExtServer() --> gpu::NumGPU()

gpu::NumGPU() would also need to be passed the context length, but in ext_server::newExtServer() it gets this a couple of lines down anyway:

numGPU := gpu.NumGPU(numLayers, fileInfo.Size(), opts)
.
.
.
sparams.n_ctx = C.uint(opts.NumCtx)

In gpu::NumGPU() you would need to use the formula above (possibly with some extra subtracted for cuBLAS as mentioned).

I'd do a pull request but I know nothing about Go and it would probably be a bodge-job considering so many different variables need passing up the chain... I think the best solution might be to calculate the KV cache size for a context length of 1, pass this up the chain to ext_server::newExtServer() , multiply it by the sparams.n_ctx value and then pass this as an extra parameter to gpu::NumGPU() to use. Hopefully somebody can try this, but if not I'll have a go, but would be much happier if somebody familiar with the codebase and Go did it.

@jukofyork commented on GitHub (Jan 5, 2024): > I've got to go out but I think we can also refine the `* 3 / 4` magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to `Gpu.go` to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it. I can confirm this page has the correct formula for calculating the KV cache: https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices ```KV cache size = batch_size * seqlen * (d_model/n_heads) * n_layers * 2 * 2 * n_kv_heads``` I did this calculation by hand for (IIRC) Llama-70b and a context length of 2048: batch_size = 1 *(NOTE: this is a different batch size to what we have here and is about serving multiple users from a A100)* d_model = 8192 n_heads = 64 n_layers = 80 n_kv_heads = 8 (1 * 2048 * (8192/64) * 80 * 2 * 2 * 8) / 1024^2 = 640MB and this is exactly the same as what llama.cpp::main prints towards the bottom of it's output when run. There are several wrong formulas floating about too: https://old.reddit.com/r/LocalLLaMA/comments/1848puo/relationship_of_ram_to_context_size/ https://old.reddit.com/r/LocalLLaMA/comments/15825bt/how_much_ram_is_needed_for_llama2_70b_32k_context/ https://www.baseten.co/blog/llm-transformer-inference-guide/ Currently the function at the bottom of Gpu.go only gets passed the size of the model and the `n_layers` value, but I assume it wouldn't be hard to change it to pass the other values from the GGUF file's header to it and do the proper calculation? IIRC, when I looked at the output of llama.cpp::main some things like `d_model` were named differently to the formula above though. ---------------- This is from the new **wizardcoder:33b-v1.1** model that is a fine-tune of **deepseek-coder:33b-instruct** which I just had the GGUF file handy for looking at: ``` llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 32256 llm_load_print_meta: n_merges = 31757 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_head = 56 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 62 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 19200 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 100000.0 llm_load_print_meta: freq_scale_train = 0.25 llm_load_print_meta: n_yarn_orig_ctx = 16384 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 33.34 B llm_load_print_meta: model size = 21.92 GiB (5.65 BPW) llm_load_print_meta: general.name = wizardlm_wizardcoder-33b-v1.1 llm_load_print_meta: BOS token = 32013 '<｜begin▁of▁sentence｜>' llm_load_print_meta: EOS token = 32014 '<｜end▁of▁sentence｜>' llm_load_print_meta: UNK token = 32022 '<unk>' llm_load_print_meta: PAD token = 32013 '<｜begin▁of▁sentence｜>' llm_load_print_meta: LF token = 126 'Ä' llm_load_tensors: ggml ctx size = 0.21 MiB llm_load_tensors: mem required = 22444.54 MiB ``` d_model <--> n_embd = 7168 *(which I think is also: n_head * n_rot)* n_heads <--> n_head = 56 n_layers <--> n_layer = 64 n_kv_heads <--> n_head_kv = 8 So redoing the calculation for a 16k context size: ``` KV cache size = batch_size * seqlen * (d_model/n_heads) * n_layers * 2 * 2 * n_kv_heads KV cache size = (1 * 16384 * (7168/56) * 64 * 2 * 2 * 8) / 1024^2 = 4096MB ``` Using the 3/4 magic number from Gpu.go on my 4090: 24GB VRAM = 24×1024 = 24576MB 24576 * 3/4 = 18432MB 18432 + 4096 = 22528MB When running the model on my ~16k token file (with num_batch=64), **nvidia-smi** is showing the same use the whole time (ie: for both prompt evaluation and for generation): 21892MB / 24564MB and this ties in with the above as the integer division in the `gpu::NumGPU()` function will be rounding down the number of layers. I don't really know enough about cuBLAS to know if it needs any VRAM to run the prompt evaluation though, but from this is doesn't look like it does (?). These are the **nvidia-smi** stats for 8192 and 4096 contexts sizes for reference: Using 8192 contexts size: 20480MB / 24564MB Using 4096 contexts size: 20238MB / 24564MB Which should have a KV cache size of 2048MB and 1024MB respectively, yet the Gpu.go function will just be allocating 3/4 of the 24GB for the offloaded layers and the extra VRAM must be getting used by cuBLAS (?). So it's not 100% clear what's going on and it's probably worthwhile doing some benchmarks to see how to incorporate the KV cache size formula properly for those of us running with much smaller or much larger context sizes to utilize our VRAM as best as possible. Anyway hope this is useful for somebody to work on refining the `gpu::NumGPU()` calculation eventually. ---------------- I just tested a 32k context model and right enough it did crash with this error: ```Error: Post "http://127.0.0.1:11434/api/generate": EOF``` So quite clearly `gpu::NumGPU()` should be dynamically calculating the layers better and the 3/4 magic number is only working through luck most of the time (and possibly wasting VRAM for those running with < 4096 context too...). So looking at the code to see how hard it would be to change: `llm::New()` has access to `ggml` which contains the required variables. Then the chain goes: ``` llm::New() --> llm::newLlmServer() --> ext_server::newDefaultExtServer() --> ext_server::newExtServer() --> gpu::NumGPU() ``` `gpu::NumGPU()` would also need to be passed the context length, but in `ext_server::newExtServer()` it gets this a couple of lines down anyway: ``` numGPU := gpu.NumGPU(numLayers, fileInfo.Size(), opts) . . . sparams.n_ctx = C.uint(opts.NumCtx) ``` In `gpu::NumGPU()` you would need to use the formula above (possibly with some extra subtracted for cuBLAS as mentioned). I'd do a pull request but I know nothing about Go and it would probably be a bodge-job considering so many different variables need passing up the chain... I think the best solution might be to calculate the KV cache size for a context length of 1, pass this up the chain to `ext_server::newExtServer() `, multiply it by the `sparams.n_ctx ` value and then pass this as an extra parameter to `gpu::NumGPU()` to use. Hopefully somebody can try this, but if not I'll have a go, but would be much happier if somebody familiar with the codebase and Go did it.

GiteaMirror commented

2026-05-03 11:38:30 -05:00

@jukofyork commented on GitHub (Jan 5, 2024):

Back to the original problem... I've found a good way to find the optimal value of num_batch:

Set num_gpu manually to something fairly conservative so it's using around 1/2 to 3/4 of your GPU's VRAM.
Create a huge file with at least 2x more tokens than context and feed it in as a prompt using the Ollama command line.
Load up nvidia-smi and watch the VRAM usage.

The VRAM usage should go up rapidly at the start and then stabilize all the way through processing the huge file.

Write down the VRAM usage from nvidia-smi when it settles and then wait until it either crashes OOM or the prompt evaluation stage is over and it starts outputting text (likely to be gibberish or it might just end without saying anything, because you've overloaded the context...).

If you have set num_batch too high then the VRAM usage will have gone up by now (assuming it hasn't crashed OOM already).

Try to find 2 values where one works and the other doesn't and just keep bisecting them:

[64, 128] --> (64+128)/2 = 96 [BAD]

[64,96] --> (64+96)/2) = 80 [GOOD]

[80,96] --> (80+96)/2 = 88 ...

and so on.

Eventually you will find the sweet spot where you can't raise it anymore without VRAM starting to leak.

Then leave num_batch fixed at the good value and start raising num_gpu until you get OOM errors (this should happen as soon as the model loads now).

You should then have optimal num_batch and num_gpu settings for that particular model and any fine-tunes of it.

I've just done this with deepseek-coder:33b-instruct and got num_batch = 86 and num_gpu = 52:

I'm sorry for any confusion, but it appears you have posted multiple files with a single post. As per Stack Overflow guidelines, each file should be submitted separately.

However, here is your code combined into one file for easy reference:

🤣

It will be interesting to see if num_batch = 86 is constant for other base models like LLama 2 or Yi.

You might also want to kill the ollama process between each test as it's not clear sometimes if it's actually reloaded the new value and/or sometimes it seems to go into a CPU-only mode where it doesn't use cuBLAS at all (ie: GPU use stays at 0% in nvidia-smi and it takes an etremetely long time to run the prompt evaluation stage).

@jukofyork commented on GitHub (Jan 5, 2024): Back to the original problem... I've found a good way to find the optimal value of `num_batch`: - Set `num_gpu` manually to something fairly conservative so it's using around 1/2 to 3/4 of your GPU's VRAM. - Create a huge file with at least 2x more tokens than context and feed it in as a prompt using the Ollama command line. - Load up `nvidia-smi` and watch the VRAM usage. The VRAM usage should go up rapidly at the start and then stabilize all the way through processing the huge file. Write down the VRAM usage from `nvidia-smi` when it settles and then wait until it either crashes OOM or the prompt evaluation stage is over and it starts outputting text (likely to be gibberish or it might just end without saying anything, because you've overloaded the context...). If you have set `num_batch` too high then the VRAM usage will have gone up by now (assuming it hasn't crashed OOM already). Try to find 2 values where one works and the other doesn't and just keep bisecting them: [64, 128] --> (64+128)/2 = 96 [BAD] [64,96] --> (64+96)/2) = 80 [GOOD] [80,96] --> (80+96)/2 = 88 ... and so on. Eventually you will find the sweet spot where you can't raise it anymore without VRAM starting to leak. Then leave `num_batch` fixed at the good value and start raising `num_gpu ` until you get OOM errors (this should happen as soon as the model loads now). You should then have optimal `num_batch` and `num_gpu ` settings for that particular model and any fine-tunes of it. I've just done this with `deepseek-coder:33b-instruct` and got `num_batch = 86` and `num_gpu = 52`: > I'm sorry for any confusion, but it appears you have posted multiple files with a single post. As per Stack Overflow guidelines, each file should be submitted separately. > > However, here is your code combined into one file for easy reference: :rofl: It will be interesting to see if `num_batch = 86` is constant for other base models like LLama 2 or Yi. ----------- You might also want to kill the ollama process between each test as it's not clear sometimes if it's actually reloaded the new value and/or sometimes it seems to go into a CPU-only mode where it doesn't use cuBLAS at all (ie: GPU use stays at 0% in `nvidia-smi` and it takes an *etremetely* long time to run the prompt evaluation stage).

GiteaMirror commented

2026-05-03 11:38:31 -05:00

@mongolu commented on GitHub (Jan 6, 2024):

Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now.

Can you run a test and see if leaving it as 512 and setting num_gpu=1still crashes for you?

I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself...

If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with num_gpu=1 and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more.

I've got to go out but I think we can also refine the * 3 / 4 magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to Gpu.go to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it.

Before putting num_batch=64, i haven't had this param in modelfile, but I've tried with num_gpu=1 and still crashed.

Pretty impressive work you've done.
I'm sorry, i don't quite follow you, maybe others more experienced.
Right now, I'm happy that it works, without crashing, till now.

@mongolu commented on GitHub (Jan 6, 2024): > > Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now. > > Can you run a test and see if leaving it as 512 and setting `num_gpu=1`still crashes for you? > > I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself... > > If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with `num_gpu=1` and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more. > > I've got to go out but I think we can also refine the `* 3 / 4` magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to `Gpu.go` to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it. Before putting num_batch=64, i haven't had this param in modelfile, but I've tried with num_gpu=1 and still crashed. Pretty impressive work you've done. I'm sorry, i don't quite follow you, maybe others more experienced. Right now, I'm happy that it works, without crashing, till now.

GiteaMirror commented

2026-05-03 11:38:33 -05:00

@jukofyork commented on GitHub (Jan 6, 2024):

I've managed to tune for deekseek-coder, codelama and yi base models now and it seems really random with optimal values using a 16k context length ranging from 80 to 180.

It does seem that fine tuned versions have almost the same optimal value but not necessarily exactly the same, so I've chosen to round down to the previous multiple of 16 for safety.

I can run nearly anything with a context length of 4096 and default the batch size of 512, apart from Mixtral that needs 256.

Mixtral still leaks memory and crashes with a 32k context length on the lowest allowable batch size of 32 if I give it a really massive file.

I'm going to retry with Q8 and Q6_K models later and see if they are any different to the current Q5_K_M models - there is some chance these use a different code path in llama.cpp and might avoid whatever is leaking VRAM.

@jukofyork commented on GitHub (Jan 6, 2024): I've managed to tune for deekseek-coder, codelama and yi base models now and it seems really random with optimal values using a 16k context length ranging from 80 to 180. It does seem that fine tuned versions have *almost* the same optimal value but not necessarily exactly the same, so I've chosen to round down to the previous multiple of 16 for safety. I can run nearly anything with a context length of 4096 and default the batch size of 512, apart from Mixtral that needs 256. Mixtral still leaks memory and crashes with a 32k context length on the lowest allowable batch size of 32 if I give it a really massive file. I'm going to retry with Q8 and Q6_K models later and see if they are any different to the current Q5_K_M models - there is some chance these use a different code path in llama.cpp and might avoid whatever is leaking VRAM.

GiteaMirror commented

2026-05-03 11:38:34 -05:00

@jukofyork commented on GitHub (Jan 6, 2024):

Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now.

Can you run a test and see if leaving it as 512 and setting num_gpu=1still crashes for you?
I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself...
If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with num_gpu=1 and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more.
I've got to go out but I think we can also refine the * 3 / 4 magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to Gpu.go to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it.

Before putting num_batch=64, i haven't had this param in modelfile, but I've tried with num_gpu=1 and still crashed.

Pretty impressive work you've done. I'm sorry, i don't quite follow you, maybe others more experienced. Right now, I'm happy that it works, without crashing, till now.

Yeah, I was having to use num_gpu=0 and had really slow generation (but still fast prompt evaluation from using cuBLAS). I'm getting a lot more usable generation now but the prompt evaluation is slower than it was...

Until this gets fixed I'm going to have 2 copies of each model: a 4k context with 512 batch size and a 16k context with the maximum non-OOM batch size, and choose between then based on the task (4k for small discussion prompts and 16k for large sourcecode ingestion prompts).

@jukofyork commented on GitHub (Jan 6, 2024): > > > Niceee! 10x, it resolved my problem (bumping into this too, oftenly). I use 64 for num_batch now. > > > > > > Can you run a test and see if leaving it as 512 and setting `num_gpu=1`still crashes for you? > > I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself... > > If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with `num_gpu=1` and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more. > > I've got to go out but I think we can also refine the `* 3 / 4` magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file to `Gpu.go` to calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it. > > Before putting num_batch=64, i haven't had this param in modelfile, but I've tried with num_gpu=1 and still crashed. > > Pretty impressive work you've done. I'm sorry, i don't quite follow you, maybe others more experienced. Right now, I'm happy that it works, without crashing, till now. Yeah, I was having to use num_gpu=0 and had really slow generation (but still fast prompt evaluation from using cuBLAS). I'm getting a lot more usable generation now but the prompt evaluation is slower than it was... Until this gets fixed I'm going to have 2 copies of each model: a 4k context with 512 batch size and a 16k context with the maximum non-OOM batch size, and choose between then based on the task (4k for small discussion prompts and 16k for large sourcecode ingestion prompts).

GiteaMirror commented

2026-05-03 11:38:36 -05:00

@jukofyork commented on GitHub (Jan 6, 2024):

Update:

Tried deepseek-coder:33b-instruct-Q8_0 and same problem...

@jukofyork commented on GitHub (Jan 6, 2024): Update: Tried `deepseek-coder:33b-instruct-Q8_0` and same problem...

GiteaMirror commented

2026-05-03 11:38:37 -05:00

@jukofyork commented on GitHub (Jan 8, 2024):

Update: I've just moved not to using lower K-quant models if I want > 4k context. This buffer leak seems to only happen when increasing the context. I can still run 4k context models fine using mix of CPU and GPU.

@jukofyork commented on GitHub (Jan 8, 2024): Update: I've just moved not to using lower K-quant models if I want > 4k context. This buffer leak seems to only happen when increasing the context. I can still run 4k context models fine using mix of CPU and GPU.

GiteaMirror commented

2026-05-03 11:38:38 -05:00

@jmorganca commented on GitHub (Mar 12, 2024):

Hi folks if it's okay I'm going to merge this with the ongoing OOM + batch size issue: #1952

@jmorganca commented on GitHub (Mar 12, 2024): Hi folks if it's okay I'm going to merge this with the ongoing OOM + batch size issue: #1952

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-mlx-decode-checkpoints

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#63064