[GH-ISSUE #12560] 🤔 Does num_batch do anything? If so what and how can we use it? #8332

Closed
opened 2026-04-12 20:54:14 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @FieldMouse-AI on GitHub (Oct 10, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12560

What is the issue?

🤔 Does num_batch do anything? If so what and how can we use it?

I only know that setting it a value like 4096 makes the model so large that it does not even load at all into my CPU, but ends up 100% CPU bound.

I am trying small values, but I am not sure if they have any real effect.

So, what is the real story?

😯😯 UPDATE: New data and identification of a possible regression

🧐🧐 It would appear that I am both correct and incorrect at the same time.

The allocation within the GPU appears to be as expected when viewed using nvtop:

  • PARAMETER num_batch 17: VRAM 6.145Gi/12.000Gi
  • PARAMETER num_batch not specified (eg, the default): VRAM 8.116Gi/12.000Gi

So, the allocation is visibly different when viewed using nvtop.

However, under 0.12.5, ollama ps will report the SIZE as 8.5 GB in both cases.

But, under 0.12.3, ollama ps will report the SIZE as approximately 6.7 GB for PARAMETER num_batch 17, and 8.5 GBfor when PARAMETER num_batch not specified (eg, the default).

🤔🤔 New Concern is about ollama ps

ollama ps is not correctly reporting the VRAM allocation as of 0.12.5.

The last good version for reporting correct VRAM values is 0.12.3.

Any thoughts, please?

This really looks like a regression.

Relevant log output


OS

Linux, Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.12.3, 0.12.5

Originally created by @FieldMouse-AI on GitHub (Oct 10, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12560 ### What is the issue? 🤔 Does num_batch do anything? If so what and how can we use it? I only know that setting it a value like 4096 makes the model so large that it does not even load at all into my CPU, but ends up 100% CPU bound. I am trying small values, but I am not sure if they have any real effect. So, what is the real story? ### 😯😯 UPDATE: New data and identification of a possible regression 🧐🧐 It would appear that I am both correct **and** incorrect at the same time. The allocation within the GPU appears to be as expected when viewed using `nvtop`: - `PARAMETER num_batch 17`: VRAM 6.145Gi/12.000Gi - `PARAMETER num_batch` not specified (eg, the default): VRAM 8.116Gi/12.000Gi So, the allocation is visibly different when viewed using `nvtop`. However, under **0.12.5**, `ollama ps` will report the `SIZE` as `8.5 GB` in both cases. But, under **0.12.3**, `ollama ps` will report the `SIZE` as approximately `6.7 GB` for `PARAMETER num_batch 17`, and `8.5 GB`for when `PARAMETER num_batch` not specified (eg, the default). ### 🤔🤔 New Concern is about `ollama ps` `ollama ps` is not correctly reporting the VRAM allocation as of **0.12.5**. The last good version for reporting correct VRAM values is **0.12.3**. Any thoughts, please? This really looks like a regression. ### Relevant log output ```shell ``` ### OS Linux, Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.12.3, 0.12.5
GiteaMirror added the bug label 2026-04-12 20:54:15 -05:00
Author
Owner

@dragetd commented on GitHub (Oct 11, 2025):

Edit: I was confused and had the wrong parameter in mind - pleas ignore.

<!-- gh-comment-id:3393033587 --> @dragetd commented on GitHub (Oct 11, 2025): Edit: I was confused and had the wrong parameter in mind - pleas ignore.
Author
Owner

@FieldMouse-AI commented on GitHub (Oct 11, 2025):

If you read online about it, it is the number of token per batch generated for streaming responses. Default is 1 I believe. Setting it to 4 or 8 reduces the overhead of network calls vs. new tokens, but causes less frequent updates. If you have a fast model (e.g. > 30 token/s) it makes sense to increase it a bit. If you have a very fast model (e.g. >100 token/s) you might even want to set it to 32 or something like this, but more is rarely useful.

@dragetd , thanks for the reply!

Could you do me a favor and reply to the following?

First: Where did you find this information? Could you supply links, please? Unfortunately, my searches yielded non-conclusive references within the Ollama and llama.cpp documentation.

Second: In my testing, leaving the num_batch parameter unset, my model would use about 70% of my GPU card's 12GB, while setting PARAMETER num_batch 17 reduced my model's allocation down to only 50% of my GPU card's 12GB.

<!-- gh-comment-id:3393036933 --> @FieldMouse-AI commented on GitHub (Oct 11, 2025): > If you read online about it, it is the number of token per batch generated for streaming responses. Default is 1 I believe. Setting it to 4 or 8 reduces the overhead of network calls vs. new tokens, but causes less frequent updates. If you have a fast model (e.g. > 30 token/s) it makes sense to increase it a bit. If you have a very fast model (e.g. >100 token/s) you might even want to set it to 32 or something like this, but more is rarely useful. @dragetd , thanks for the reply! Could you do me a favor and reply to the following? **First:** Where did you find this information? Could you supply links, please? Unfortunately, my searches yielded non-conclusive references within the Ollama and llama.cpp documentation. **Second:** In my testing, leaving the `num_batch` parameter unset, my model would use about 70% of my GPU card's 12GB, while setting `PARAMETER num_batch 17` **reduced** my model's allocation down to only 50% of my GPU card's 12GB.
Author
Owner

@dragetd commented on GitHub (Oct 11, 2025):

@meidaid Okay, I actually confused the parameter myself and was thinking of the wrong one! I am very sorry, please ignore my post!

<!-- gh-comment-id:3393040335 --> @dragetd commented on GitHub (Oct 11, 2025): @meidaid Okay, I actually confused the parameter myself and was thinking of the wrong one! I am very sorry, please ignore my post!
Author
Owner

@FieldMouse-AI commented on GitHub (Oct 15, 2025):

It would appear that I am both correct and incorrect at the same time.

The allocation within the GPU appears to be as expected when viewed using nvtop:

  • PARAMETER num_batch 17: VRAM 6.145Gi/12.000Gi
  • PARAMETER num_batch not specified (eg, the default): VRAM 8.116Gi/12.000Gi

So, the allocation is visibly different when viewed using nvtop.

However, under 0.12.5, ollama ps will report the SIZE as 8.5 GB in both cases.

But, under 0.12.3, ollama ps will report the SIZE as approximately 6.7 GB for PARAMETER num_batch 17, and 8.5 GBfor when PARAMETER num_batch not specified (eg, the default).

New Concern is about ollama ps

ollama ps is not correctly reporting the VRAM allocation as of 0.12.5.

The last good version for reporting correct VRAM values is 0.12.3.

Any thoughts, please?

This really looks like a regression.

<!-- gh-comment-id:3408638642 --> @FieldMouse-AI commented on GitHub (Oct 15, 2025): It would appear that I am both correct **and** incorrect at the same time. The allocation within the GPU appears to be as expected when viewed using `nvtop`: - `PARAMETER num_batch 17`: VRAM 6.145Gi/12.000Gi - `PARAMETER num_batch` not specified (eg, the default): VRAM 8.116Gi/12.000Gi So, the allocation is visibly different when viewed using `nvtop`. However, under **0.12.5**, `ollama ps` will report the `SIZE` as `8.5 GB` in both cases. But, under **0.12.3**, `ollama ps` will report the `SIZE` as approximately `6.7 GB` for `PARAMETER num_batch 17`, and `8.5 GB`for when `PARAMETER num_batch` not specified (eg, the default). ### New Concern is about `ollama ps` `ollama ps` is not correctly reporting the VRAM allocation as of **0.12.5**. The last good version for reporting correct VRAM values is **0.12.3**. Any thoughts, please? This really looks like a regression.
Author
Owner

@jessegross commented on GitHub (Oct 16, 2025):

@meidaid Please attach logs for the scenarios where you see differences. Everything looks like it is working fine to me. Be aware that the memory impact of batch size also depends on the model and context size, so ensure that you are holding those constant.

<!-- gh-comment-id:3408699485 --> @jessegross commented on GitHub (Oct 16, 2025): @meidaid Please attach logs for the scenarios where you see differences. Everything looks like it is working fine to me. Be aware that the memory impact of batch size also depends on the model and context size, so ensure that you are holding those constant.
Author
Owner

@FieldMouse-AI commented on GitHub (Oct 16, 2025):

@meidaid Please attach logs for the scenarios where you see differences. Everything looks like it is working fine to me. Be aware that the memory impact of batch size also depends on the model and context size, so ensure that you are holding those constant.

@jessegross , thanks for your reply!

The good news is that I do keep the num_ctx 100% stable at PARAMETER num_ctx 22000.

I will work to gather the logs.

<!-- gh-comment-id:3408712733 --> @FieldMouse-AI commented on GitHub (Oct 16, 2025): > [@meidaid](https://github.com/meidaid) Please attach logs for the scenarios where you see differences. Everything looks like it is working fine to me. Be aware that the memory impact of batch size also depends on the model and context size, so ensure that you are holding those constant. @jessegross , thanks for your reply! The good news is that I do keep the `num_ctx` 100% stable at `PARAMETER num_ctx 22000`. I will work to gather the logs.
Author
Owner

@FieldMouse-AI commented on GitHub (Oct 16, 2025):

@meidaid Please attach logs for the scenarios where you see differences. Everything looks like it is working fine to me. Be aware that the memory impact of batch size also depends on the model and context size, so ensure that you are holding those constant.

@jessegross , thanks for your reply!

The good news is that I do keep the num_ctx 100% stable at PARAMETER num_ctx 22000.

I will work to gather the logs.

Hello, @jessegross , I've added the logs.

In both cases, the conditions are as follows:

  • Ollama v0.12.5
  • The model is based on qwen3:4b-instruct-2507-q4_K_M
  • PARAMETER num_ctx 22000

The test cases are:

  1. Test Case 1: PARAMETER num_batch 17 exists in the Modelfile.
  2. Test Case 2: PARAMETER num_batch does not exist in the Modelfile.

Attached logs are below:

num_batch-17.log
num_batch-unset.log

<!-- gh-comment-id:3410517372 --> @FieldMouse-AI commented on GitHub (Oct 16, 2025): > > [@meidaid](https://github.com/meidaid) Please attach logs for the scenarios where you see differences. Everything looks like it is working fine to me. Be aware that the memory impact of batch size also depends on the model and context size, so ensure that you are holding those constant. > > [@jessegross](https://github.com/jessegross) , thanks for your reply! > > The good news is that I do keep the `num_ctx` 100% stable at `PARAMETER num_ctx 22000`. > > I will work to gather the logs. Hello, @jessegross , I've added the logs. In both cases, the conditions are as follows: - Ollama v0.12.5 - The model is based on `qwen3:4b-instruct-2507-q4_K_M` - `PARAMETER num_ctx 22000` The test cases are: 1. Test Case 1: `PARAMETER num_batch 17` exists in the `Modelfile`. 2. Test Case 2: `PARAMETER num_batch` does not exist in the `Modelfile`. Attached logs are below: [num_batch-17.log](https://github.com/user-attachments/files/22948234/num_batch-17.log) [num_batch-unset.log](https://github.com/user-attachments/files/22948233/num_batch-unset.log)
Author
Owner

@jessegross commented on GitHub (Oct 16, 2025):

The difference is caused by the use of the old llama engine vs the new Ollama engine. Although both do account for batch size when calculating memory size, the Ollama engine is more accurate overall and should closely track nvidia-smi much more closely.

We are in the process of migrating models over to the Ollama engine incrementally and qwen3 was running on it in 0.12.3. However, it was temporarily moved back in 0.12.5 to fix a regression that was encountered. It will on the Ollama engine again in 0.12.6.

<!-- gh-comment-id:3412334934 --> @jessegross commented on GitHub (Oct 16, 2025): The difference is caused by the use of the old llama engine vs the new Ollama engine. Although both do account for batch size when calculating memory size, the Ollama engine is more accurate overall and should closely track nvidia-smi much more closely. We are in the process of migrating models over to the Ollama engine incrementally and qwen3 was running on it in 0.12.3. However, it was temporarily moved back in 0.12.5 to fix a regression that was encountered. It will on the Ollama engine again in 0.12.6.
Author
Owner

@FieldMouse-AI commented on GitHub (Oct 16, 2025):

The difference is caused by the use of the old llama engine vs the new Ollama engine. Although both do account for batch size when calculating memory size, the Ollama engine is more accurate overall and should closely track nvidia-smi much more closely.

We are in the process of migrating models over to the Ollama engine incrementally and qwen3 was running on it in 0.12.3. However, it was temporarily moved back in 0.12.5 to fix a regression that was encountered. It will on the Ollama engine again in 0.12.6.

Ah, thank you! So, in 0.12.6 ollama ps will match nvtop again, right?

If so, that would make reducing memory allocations and debugging things so much easier again!

I will install 0.12.6 as soon as it is available and retest!

<!-- gh-comment-id:3412958702 --> @FieldMouse-AI commented on GitHub (Oct 16, 2025): > The difference is caused by the use of the old llama engine vs the new Ollama engine. Although both do account for batch size when calculating memory size, the Ollama engine is more accurate overall and should closely track nvidia-smi much more closely. > > We are in the process of migrating models over to the Ollama engine incrementally and qwen3 was running on it in 0.12.3. However, it was temporarily moved back in 0.12.5 to fix a regression that was encountered. It will on the Ollama engine again in 0.12.6. Ah, thank you! So, in 0.12.6 `ollama ps` will match `nvtop` again, right? If so, that would make reducing memory allocations and debugging things so much easier again! I will install 0.12.6 as soon as it is available and retest!
Author
Owner

@FieldMouse-AI commented on GitHub (Oct 17, 2025):

The difference is caused by the use of the old llama engine vs the new Ollama engine. Although both do account for batch size when calculating memory size, the Ollama engine is more accurate overall and should closely track nvidia-smi much more closely.
We are in the process of migrating models over to the Ollama engine incrementally and qwen3 was running on it in 0.12.3. However, it was temporarily moved back in 0.12.5 to fix a regression that was encountered. It will on the Ollama engine again in 0.12.6.

Ah, thank you! So, in 0.12.6 ollama ps will match nvtop again, right?

If so, that would make reducing memory allocations and debugging things so much easier again!

I will install 0.12.6 as soon as it is available and retest!

Thank you again, @jessegross !!!

Ollama 0.12.6 was just released and I ran tests on this new version of Ollama.

I am happy to report that numbers from ollama ps are more in line with what is shown in nvtop.
(Please, keep reading... you will see where my little concern arises...)

While it is easier for me to make estimates about what models can fit into a single card, the numbers feel slightly off. This might be due to nvtop using GiB while ollama ps might be using tghe more traditional GB?

For example: The sum of my two models appear as the following:

  • ollama ps: 7.8GB
  • nvtop: 6.730Gi/12.000Gi

So, there is an improvement.
But I am curious: Why this difference in values? 🤔

PS: I went back and stopped the smaller embedding model...

I went back and stopped the smaller embedding model and found that the numbers for just the larger model looked more normal:

  • ollama ps: 6.1GB
  • nvtop: 6.012Gi/12.000Gi

So, perhaps it might be that my embedding model, bge-me, might be doing something different with allocations??? 🤔

<!-- gh-comment-id:3413567694 --> @FieldMouse-AI commented on GitHub (Oct 17, 2025): > > The difference is caused by the use of the old llama engine vs the new Ollama engine. Although both do account for batch size when calculating memory size, the Ollama engine is more accurate overall and should closely track nvidia-smi much more closely. > > We are in the process of migrating models over to the Ollama engine incrementally and qwen3 was running on it in 0.12.3. However, it was temporarily moved back in 0.12.5 to fix a regression that was encountered. It will on the Ollama engine again in 0.12.6. > > Ah, thank you! So, in 0.12.6 `ollama ps` will match `nvtop` again, right? > > If so, that would make reducing memory allocations and debugging things so much easier again! > > I will install 0.12.6 as soon as it is available and retest! Thank you again, @jessegross !!! Ollama **0.12.6** was just released and I ran tests on this new version of Ollama. I am happy to report that numbers from `ollama ps` are _**more**_ in line with what is shown in `nvtop`. (Please, keep reading... you will see where my little concern arises...) While it is easier for me to make estimates about what models can fit into a single card, the numbers feel slightly off. This might be due to `nvtop` using `GiB` while `ollama ps` might be using tghe more traditional `GB`? For example: The sum of my two models appear as the following: - `ollama ps`: 7.8GB - `nvtop`: 6.730Gi/12.000Gi So, there is an improvement. But I am curious: Why this difference in values? 🤔 ### PS: I went back and stopped the smaller embedding model... I went back and stopped the smaller embedding model and found that the numbers for just the larger model looked more normal: - `ollama ps`: 6.1GB - `nvtop`: 6.012Gi/12.000Gi So, perhaps it might be that my embedding model, `bge-me`, might be doing something different with allocations??? 🤔
Author
Owner

@FieldMouse-AI commented on GitHub (Oct 18, 2025):

Hello, @jessegross !

Thanks again for all of your help.

In testing 0.12.6, I've found it to be quite stable and the memory footprint is now well within my requirements for my workflow.

I just have one more question to ask, though: What exactly is num_batch used for and how does it really affect memory? I could not find a good explanation reading either the Ollama docs or the llama.cpp docs. Any references you might be able to share?

<!-- gh-comment-id:3418368401 --> @FieldMouse-AI commented on GitHub (Oct 18, 2025): Hello, @jessegross ! Thanks again for all of your help. In testing **0.12.6**, I've found it to be quite stable and the memory footprint is now well within my requirements for my workflow. I just have one more question to ask, though: What exactly is `num_batch` used for and how does it really affect memory? I could not find a good explanation reading either the Ollama docs or the llama.cpp docs. Any references you might be able to share?
Author
Owner

@FieldMouse-AI commented on GitHub (Oct 18, 2025):

PS: I went back and stopped the smaller embedding model...

I went back and stopped the smaller embedding model and found that the numbers for just the larger model looked more normal:

  • ollama ps: 6.1GB
  • nvtop: 6.012Gi/12.000Gi

So, perhaps it might be that my embedding model, bge-me, might be doing something different with allocations??? 🤔

Goodnews, @jessegross !

🤗 I did some testing with num_batch and my embedding model through the ollama.embed() API (https://github.com/ollama/ollama/blob/main/docs/api.md#generate-embeddings) and found that when I reduced it to num_batch: 2 that, according to nvtop, its memory footprint went down from 1.7GB down to just about 0.7GB!

😭 Unfortunately, similar to the problem that we say with regular models, the embedding model's reduced memory footprint was not reflected in ollama ps where it still reported the allocation as 1.7GB when it should have been 0.7GB.

🤔 Does this look like it would be a similar fix for the ollama.embed() API that was done for the ollama.chat() / ollama.generate() APIs?

Thanks in advance!

<!-- gh-comment-id:3418772463 --> @FieldMouse-AI commented on GitHub (Oct 18, 2025): > ### PS: I went back and stopped the smaller embedding model... > I went back and stopped the smaller embedding model and found that the numbers for just the larger model looked more normal: > > * `ollama ps`: 6.1GB > * `nvtop`: 6.012Gi/12.000Gi > > So, perhaps it might be that my embedding model, `bge-me`, might be doing something different with allocations??? 🤔 Goodnews, @jessegross ! 🤗 I did some testing with `num_batch` and my embedding model through the `ollama.embed()` API (https://github.com/ollama/ollama/blob/main/docs/api.md#generate-embeddings) and found that when I reduced it to `num_batch: 2` that, according to `nvtop`, its memory footprint went **down from 1.7GB down to just about 0.7GB**! 😭 Unfortunately, similar to the problem that we say with regular models, the embedding model's reduced memory footprint was not reflected in `ollama ps` where it **still reported the allocation as 1.7GB when it should have been 0.7GB**. 🤔 Does this look like it would be a similar fix for the `ollama.embed()` API that was done for the `ollama.chat()` / `ollama.generate()` APIs? Thanks in advance!
Author
Owner

@jessegross commented on GitHub (Oct 20, 2025):

It's a similar issue as before. bge-m3 runs on the old engine where memory allocation is not as good. Embedding models that run on the Ollama engine with better memory management include qwen3-embedding and embeddinggemma.

Regarding the original question, batch size mostly a tradeoff between speed and memory usage - larger batch sizes will process more tokens at once, which is faster but requirements more memory. However, there are situations which require all tokens to be present in the same batch, including embedding and image models. In general, I would recommend leaving it at the default.

<!-- gh-comment-id:3423184232 --> @jessegross commented on GitHub (Oct 20, 2025): It's a similar issue as before. bge-m3 runs on the old engine where memory allocation is not as good. Embedding models that run on the Ollama engine with better memory management include qwen3-embedding and embeddinggemma. Regarding the original question, batch size mostly a tradeoff between speed and memory usage - larger batch sizes will process more tokens at once, which is faster but requirements more memory. However, there are situations which require all tokens to be present in the same batch, including embedding and image models. In general, I would recommend leaving it at the default.
Author
Owner

@FieldMouse-AI commented on GitHub (Oct 21, 2025):

It's a similar issue as before. bge-m3 runs on the old engine where memory allocation is not as good. Embedding models that run on the Ollama engine with better memory management include qwen3-embedding and embeddinggemma.

Regarding the original question, batch size mostly a tradeoff between speed and memory usage - larger batch sizes will process more tokens at once, which is faster but requirements more memory. However, there are situations which require all tokens to be present in the same batch, including embedding and image models. In general, I would recommend leaving it at the default.

Hello, @jessegross , I see what you mean now.

I am already using Qwen3 for other things, so I can try replacing bge-m3 with qwen3-embedding and fully retest my system.

Thanks for the insights!

Do you think we should close this issue now or should we wait until after I have some statistics from after I replace bge-m3 with qwen3-embedding?

<!-- gh-comment-id:3429506746 --> @FieldMouse-AI commented on GitHub (Oct 21, 2025): > It's a similar issue as before. bge-m3 runs on the old engine where memory allocation is not as good. Embedding models that run on the Ollama engine with better memory management include qwen3-embedding and embeddinggemma. > > Regarding the original question, batch size mostly a tradeoff between speed and memory usage - larger batch sizes will process more tokens at once, which is faster but requirements more memory. However, there are situations which require all tokens to be present in the same batch, including embedding and image models. In general, I would recommend leaving it at the default. Hello, @jessegross , I see what you mean now. I am already using Qwen3 for other things, so I can try replacing `bge-m3 ` with `qwen3-embedding` and fully retest my system. Thanks for the insights! Do you think we should close this issue now **or** should we wait until after I have some statistics from after I replace `bge-m3 ` with `qwen3-embedding`?
Author
Owner

@jessegross commented on GitHub (Oct 21, 2025):

I think we can close the issue - it should have a similar impact to the chat model running on the Ollama engine.

<!-- gh-comment-id:3429890872 --> @jessegross commented on GitHub (Oct 21, 2025): I think we can close the issue - it should have a similar impact to the chat model running on the Ollama engine.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8332