[GH-ISSUE #4170] OLLAMA_NUM_PARALLEL cannot be the solution. #28352

Closed
opened 2026-04-22 06:28:34 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @KevinKrueger on GitHub (May 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4170

Originally assigned to: @dhiltgen on GitHub.

You don't actually need this setting.
Ollama should actually know on its own how many parallel requests it can handle.

Because Ollama should know what hardware and corresponding performance is available. This information can be analyzed.

For example, if you send several requests where you already know in advance how much performance a request will consume, then an error message should come from the API "No further requests can be processed by the model due to the installed hardware." Then you could provide some good information on what you could do to avoid this or in the task you could say how much performance is required for each additional request so that a reasonably "fast" answer comes back.

Originally created by @KevinKrueger on GitHub (May 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4170 Originally assigned to: @dhiltgen on GitHub. You don't actually need this setting. Ollama should actually know on its own how many parallel requests it can handle. Because Ollama should know what hardware and corresponding performance is available. This information can be analyzed. For example, if you send several requests where you already know in advance how much performance a request will consume, then an error message should come from the API "No further requests can be processed by the model due to the installed hardware." Then you could provide some good information on what you could do to avoid this or in the task you could say how much performance is required for each additional request so that a reasonably "fast" answer comes back.
GiteaMirror added the feature request label 2026-04-22 06:28:35 -05:00
Author
Owner

@jmorganca commented on GitHub (May 5, 2024):

Yes absolutely! You hit the nail on the head on how this feature will evolve 😊 . OLLAMA_NUM_PARALLEL is an experimental feature flag that will eventually be replaced with this sort of improvement.

<!-- gh-comment-id:2094847081 --> @jmorganca commented on GitHub (May 5, 2024): Yes absolutely! You hit the nail on the head on how this feature will evolve 😊 . `OLLAMA_NUM_PARALLEL` is an experimental feature flag that will eventually be replaced with this sort of improvement.
Author
Owner

@0sengseng0 commented on GitHub (May 6, 2024):

Yes absolutely! You hit the nail on the head on how this feature will evolve 😊 . OLLAMA_NUM_PARALLEL is an experimental feature flag that will eventually be replaced with this sort of improvement.

But why is the request still not parallel when I configure OLLAMA_NUM_PARALLEL?
And, the num_ctx I configured did not work either 😊
image
image

<!-- gh-comment-id:2095593197 --> @0sengseng0 commented on GitHub (May 6, 2024): > Yes absolutely! You hit the nail on the head on how this feature will evolve 😊 . `OLLAMA_NUM_PARALLEL` is an experimental feature flag that will eventually be replaced with this sort of improvement. But why is the request still not parallel when I configure OLLAMA_NUM_PARALLEL? And, the num_ctx I configured did not work either 😊 ![image](https://github.com/ollama/ollama/assets/73268510/bc8df5eb-5863-450b-aaf0-ae6811cfd090) ![image](https://github.com/ollama/ollama/assets/73268510/ed411abe-dfc0-44f0-b8c9-9f9d2c115810)
Author
Owner

@jmorganca commented on GitHub (May 6, 2024):

Hi @0sengseng0 it seems you're missing an O: OLLAMA_NUM_PARALLEL=3

Will close this as it is definitely plan A :)

<!-- gh-comment-id:2097062842 --> @jmorganca commented on GitHub (May 6, 2024): Hi @0sengseng0 it seems you're missing an O: `OLLAMA_NUM_PARALLEL=3` Will close this as it is definitely plan A :)
Author
Owner

@0sengseng0 commented on GitHub (May 7, 2024):

Hi @0sengseng0 it seems you're missing an O: OLLAMA_NUM_PARALLEL=3

Will close this as it is definitely plan A :)

According to the logs, n_ctx is still the default 2048 and has not been modified. What about large tokens? Is it truncated? (Because I see GPU usage is always fixed, even for large tokens, multi-threading, token usage is still the same.) OX

<!-- gh-comment-id:2097199713 --> @0sengseng0 commented on GitHub (May 7, 2024): > Hi @0sengseng0 it seems you're missing an O: `OLLAMA_NUM_PARALLEL=3` > > Will close this as it is definitely plan A :) According to the logs, n_ctx is still the default 2048 and has not been modified. What about large tokens? Is it truncated? (Because I see GPU usage is always fixed, even for large tokens, multi-threading, token usage is still the same.) OX
Author
Owner

@DirtyKnightForVi commented on GitHub (May 7, 2024):

Hi @0sengseng0 it seems you're missing an O: OLLAMA_NUM_PARALLEL=3
Will close this as it is definitely plan A :)

According to the logs, n_ctx is still the default 2048 and has not been modified. What about large tokens? Is it truncated? (Because I see GPU usage is always fixed, even for large tokens, multi-threading, token usage is still the same.) OX

OLLAMA_NUM_PARALLEL will split n_ctx

<!-- gh-comment-id:2097241401 --> @DirtyKnightForVi commented on GitHub (May 7, 2024): > > Hi @0sengseng0 it seems you're missing an O: `OLLAMA_NUM_PARALLEL=3` > > Will close this as it is definitely plan A :) > > According to the logs, n_ctx is still the default 2048 and has not been modified. What about large tokens? Is it truncated? (Because I see GPU usage is always fixed, even for large tokens, multi-threading, token usage is still the same.) OX OLLAMA_NUM_PARALLEL will split n_ctx
Author
Owner

@DirtyKnightForVi commented on GitHub (May 7, 2024):

in my case, my prompt as follows:

DDL

create table XXXXX
(about 12000 tokens)

HINT

(about 500 tokens, notes, etc)

if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names.

<!-- gh-comment-id:2097253581 --> @DirtyKnightForVi commented on GitHub (May 7, 2024): in my case, my prompt as follows: # DDL ```sql create table XXXXX (about 12000 tokens) ``` # HINT (about 500 tokens, notes, etc) if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names.
Author
Owner

@0sengseng0 commented on GitHub (May 7, 2024):

in my case, my prompt as follows:

DDL

create table XXXXX
(about 12000 tokens)

HINT

(about 500 tokens, notes, etc)

if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names.

Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens?

<!-- gh-comment-id:2097276448 --> @0sengseng0 commented on GitHub (May 7, 2024): > in my case, my prompt as follows: > > # DDL > ```sql > create table XXXXX > (about 12000 tokens) > ``` > > # HINT > (about 500 tokens, notes, etc) > > if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names. Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens?
Author
Owner

@DirtyKnightForVi commented on GitHub (May 7, 2024):

in my case, my prompt as follows:

DDL

create table XXXXX
(about 12000 tokens)

HINT

(about 500 tokens, notes, etc)
if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names.

Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens?

Yeah, sure.

So, I use openweb UI with ollama.

Each model instance is set by parameters like n_ctx, while OLLAMA_NUM_PARALLEL is a shared parameter for all instances. If we take any two instances with n_ctx=A and n_ctx=B, then the actual context for each instance is calculated as: n_ctx / OLLAMA_NUM_PARALLEL

<!-- gh-comment-id:2097288219 --> @DirtyKnightForVi commented on GitHub (May 7, 2024): > > in my case, my prompt as follows: > > # DDL > > ```sql > > create table XXXXX > > (about 12000 tokens) > > ``` > > > > > > > > > > > > > > > > > > > > > > > > # HINT > > (about 500 tokens, notes, etc) > > if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names. > > Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens? Yeah, sure. So, I use openweb UI with ollama. Each model instance is set by parameters like `n_ctx`, while `OLLAMA_NUM_PARALLEL` is a shared parameter for all instances. If we take any two instances with `n_ctx=A` and `n_ctx=B`, then the actual context for each instance is calculated as: `n_ctx / OLLAMA_NUM_PARALLEL`
Author
Owner

@0sengseng0 commented on GitHub (May 7, 2024):

in my case, my prompt as follows:就我而言,我的提示如下:

DDL 数据定义语言 

create table XXXXX
(about 12000 tokens)

HINT 暗示 

(about 500 tokens, notes, etc)(约500个代币、票据等)
if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names.如果OLLAMA_NUM_PARALLEL=2,模型无法获取表名。如果OLLAMA_NUM_PARALLEL=1,模型的输出可以准确无误地识别表名。

Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens?您的意思是,对于在统一时间段内收到的请求,n_ctx 是共享的吗?例如,如果我在 1s 内发送 10 个请求,n_ctx 为 20480,那么每个请求实际上只需要 2048 个令牌吗?

Yeah, sure. 好,当然。 

So, I use openweb UI with ollama.所以,我将 openweb UI 与 ollama 一起使用。

Each model instance is set by parameters like n_ctx, while OLLAMA_NUM_PARALLEL is a shared parameter for all instances. If we take any two instances with n_ctx=A and n_ctx=B, then the actual context for each instance is calculated as: n_ctx / OLLAMA_NUM_PARALLEL每个模型实例都由 n_ctx 等参数设置,而 OLLAMA_NUM_PARALLEL 是所有实例的共享参数。如果我们采用 n_ctx=An_ctx=B 的任意两个实例,则每个实例的实际上下文计算如下:n_ctx / OLLAMA_NUM_PARALLEL

in my case, my prompt as follows:

DDL

create table XXXXX
(about 12000 tokens)

HINT

(about 500 tokens, notes, etc)
if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names.

Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens?

Yeah, sure.

So, I use openweb UI with ollama.

Each model instance is set by parameters like n_ctx, while OLLAMA_NUM_PARALLEL is a shared parameter for all instances. If we take any two instances with n_ctx=A and n_ctx=B, then the actual context for each instance is calculated as: n_ctx / OLLAMA_NUM_PARALLEL

thannnnnnnnnnnnnnnnnnnks,you gave me a sense of why these two problems occurred:

  1. When I set OLLAMA_NUM_PARALLEL=100, the response is only one sentence.
  2. The GPU occupancy is constant all the time.

Now is there anything ollama can do to improve GPU usage? I changed these two parameters, but ollama still doesn't use more resources.
image

<!-- gh-comment-id:2097336208 --> @0sengseng0 commented on GitHub (May 7, 2024): > > > in my case, my prompt as follows:就我而言,我的提示如下: > > > # DDL 数据定义语言  > > > ```sql > > > create table XXXXX > > > (about 12000 tokens) > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > # HINT 暗示  > > > (about 500 tokens, notes, etc)(约500个代币、票据等) > > > if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names.如果OLLAMA_NUM_PARALLEL=2,模型无法获取表名。如果OLLAMA_NUM_PARALLEL=1,模型的输出可以准确无误地识别表名。 > > > > > > Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens?您的意思是,对于在统一时间段内收到的请求,n_ctx 是共享的吗?例如,如果我在 1s 内发送 10 个请求,n_ctx 为 20480,那么每个请求实际上只需要 2048 个令牌吗? > > Yeah, sure. 好,当然。  > > So, I use openweb UI with ollama.所以,我将 openweb UI 与 ollama 一起使用。 > > Each model instance is set by parameters like `n_ctx`, while `OLLAMA_NUM_PARALLEL` is a shared parameter for all instances. If we take any two instances with `n_ctx=A` and `n_ctx=B`, then the actual context for each instance is calculated as: `n_ctx / OLLAMA_NUM_PARALLEL`每个模型实例都由 `n_ctx` 等参数设置,而 `OLLAMA_NUM_PARALLEL` 是所有实例的共享参数。如果我们采用 `n_ctx=A` 和 `n_ctx=B` 的任意两个实例,则每个实例的实际上下文计算如下:`n_ctx / OLLAMA_NUM_PARALLEL`。 > > > in my case, my prompt as follows: > > > # DDL > > > ```sql > > > create table XXXXX > > > (about 12000 tokens) > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > # HINT > > > (about 500 tokens, notes, etc) > > > if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names. > > > > > > Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens? > > Yeah, sure. > > So, I use openweb UI with ollama. > > Each model instance is set by parameters like `n_ctx`, while `OLLAMA_NUM_PARALLEL` is a shared parameter for all instances. If we take any two instances with `n_ctx=A` and `n_ctx=B`, then the actual context for each instance is calculated as: `n_ctx / OLLAMA_NUM_PARALLEL` thannnnnnnnnnnnnnnnnnnks,you gave me a sense of why these two problems occurred: 1. When I set OLLAMA_NUM_PARALLEL=100, the response is only one sentence. 2. The GPU occupancy is constant all the time. Now is there anything ollama can do to improve GPU usage? I changed these two parameters, but ollama still doesn't use more resources. ![image](https://github.com/ollama/ollama/assets/73268510/148d7f87-9a55-450b-ad15-9b9e85f02291)
Author
Owner

@DirtyKnightForVi commented on GitHub (May 7, 2024):

in my case, my prompt as follows:就我而言,我的提示如下:

DDL 数据定义语言 

create table XXXXX
(about 12000 tokens)

HINT 暗示 

(about 500 tokens, notes, etc)(约500个代币、票据等)
if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names.如果OLLAMA_NUM_PARALLEL=2,模型无法获取表名。如果OLLAMA_NUM_PARALLEL=1,模型的输出可以准确无误地识别表名。

Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens?您的意思是,对于在统一时间段内收到的请求,n_ctx 是共享的吗?例如,如果我在 1s 内发送 10 个请求,n_ctx 为 20480,那么每个请求实际上只需要 2048 个令牌吗?

Yeah, sure. 好,当然。 
So, I use openweb UI with ollama.所以,我将 openweb UI 与 ollama 一起使用。
Each model instance is set by parameters like n_ctx, while OLLAMA_NUM_PARALLEL is a shared parameter for all instances. If we take any two instances with n_ctx=A and n_ctx=B, then the actual context for each instance is calculated as: n_ctx / OLLAMA_NUM_PARALLEL每个模型实例都由 n_ctx 等参数设置,而 OLLAMA_NUM_PARALLEL 是所有实例的共享参数。如果我们采用 n_ctx=An_ctx=B 的任意两个实例,则每个实例的实际上下文计算如下:n_ctx / OLLAMA_NUM_PARALLEL

in my case, my prompt as follows:

DDL

create table XXXXX
(about 12000 tokens)

HINT

(about 500 tokens, notes, etc)
if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names.

Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens?

Yeah, sure.
So, I use openweb UI with ollama.
Each model instance is set by parameters like n_ctx, while OLLAMA_NUM_PARALLEL is a shared parameter for all instances. If we take any two instances with n_ctx=A and n_ctx=B, then the actual context for each instance is calculated as: n_ctx / OLLAMA_NUM_PARALLEL

thannnnnnnnnnnnnnnnnnnks,you gave me a sense of why these two problems occurred:

  1. When I set OLLAMA_NUM_PARALLEL=100, the response is only one sentence.
  2. The GPU occupancy is constant all the time.

Now is there anything ollama can do to improve GPU usage? I changed these two parameters, but ollama still doesn't use more resources. image

I'm glad I could help you out. 😊
From what I've practiced and observed:

  1. It seems that Ollama dynamically regulates resource allocation.
  2. Once multiple instances are loaded, the resources they occupy remain constant within the time specified by the keep_alive parameter.
  3. Different instances of the same model may have varying resource usage. For example, Model A with num_ctx set to 1024 or 2048 will have different memory consumption.
<!-- gh-comment-id:2097364308 --> @DirtyKnightForVi commented on GitHub (May 7, 2024): > > > > in my case, my prompt as follows:就我而言,我的提示如下: > > > > # DDL 数据定义语言  > > > > ```sql > > > > create table XXXXX > > > > (about 12000 tokens) > > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > # HINT 暗示  > > > > (about 500 tokens, notes, etc)(约500个代币、票据等) > > > > if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names.如果OLLAMA_NUM_PARALLEL=2,模型无法获取表名。如果OLLAMA_NUM_PARALLEL=1,模型的输出可以准确无误地识别表名。 > > > > > > > > > Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens?您的意思是,对于在统一时间段内收到的请求,n_ctx 是共享的吗?例如,如果我在 1s 内发送 10 个请求,n_ctx 为 20480,那么每个请求实际上只需要 2048 个令牌吗? > > > > > > Yeah, sure. 好,当然。  > > So, I use openweb UI with ollama.所以,我将 openweb UI 与 ollama 一起使用。 > > Each model instance is set by parameters like `n_ctx`, while `OLLAMA_NUM_PARALLEL` is a shared parameter for all instances. If we take any two instances with `n_ctx=A` and `n_ctx=B`, then the actual context for each instance is calculated as: `n_ctx / OLLAMA_NUM_PARALLEL`每个模型实例都由 `n_ctx` 等参数设置,而 `OLLAMA_NUM_PARALLEL` 是所有实例的共享参数。如果我们采用 `n_ctx=A` 和 `n_ctx=B` 的任意两个实例,则每个实例的实际上下文计算如下:`n_ctx / OLLAMA_NUM_PARALLEL`。 > > > > > in my case, my prompt as follows: > > > > # DDL > > > > ```sql > > > > create table XXXXX > > > > (about 12000 tokens) > > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > # HINT > > > > (about 500 tokens, notes, etc) > > > > if OLLAMA_NUM_PARALLEL=2,model can not get table name。 ifOLLAMA_NUM_PARALLEL=1, The model's output can accurately and flawlessly recognize table names. > > > > > > > > > Do you mean that n_ctx is shared for requests received in a uniform time period? For example, if I send 10 requests in 1s with an n_ctx of 20480, does each request actually take only 2048 tokens? > > > > > > Yeah, sure. > > So, I use openweb UI with ollama. > > Each model instance is set by parameters like `n_ctx`, while `OLLAMA_NUM_PARALLEL` is a shared parameter for all instances. If we take any two instances with `n_ctx=A` and `n_ctx=B`, then the actual context for each instance is calculated as: `n_ctx / OLLAMA_NUM_PARALLEL` > > thannnnnnnnnnnnnnnnnnnks,you gave me a sense of why these two problems occurred: > > 1. When I set OLLAMA_NUM_PARALLEL=100, the response is only one sentence. > 2. The GPU occupancy is constant all the time. > > Now is there anything ollama can do to improve GPU usage? I changed these two parameters, but ollama still doesn't use more resources. ![image](https://private-user-images.githubusercontent.com/73268510/328370928-148d7f87-9a55-450b-ad15-9b9e85f02291.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTUwNTEwMTQsIm5iZiI6MTcxNTA1MDcxNCwicGF0aCI6Ii83MzI2ODUxMC8zMjgzNzA5MjgtMTQ4ZDdmODctOWE1NS00NTBiLWFkMTUtOWI5ZTg1ZjAyMjkxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTA3VDAyNTgzNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWVlZTIxZGU0MjRmODEwODM2OTNiOTM2MTUzM2M3ZTAwNDUwOTYzOTA3OTI4ODc3Y2ExMzJlZDNhMzgzZjFjMzAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.OMr1YJP8Dyp6zMJOEv2vOl9RY2w1wD_3hSIQJhdNYl0) I'm glad I could help you out. 😊 From what I've practiced and observed: 1. It seems that Ollama dynamically regulates resource allocation. 2. Once multiple instances are loaded, the resources they occupy remain constant within the time specified by the `keep_alive` parameter. 3. Different instances of the same model may have varying resource usage. For example, Model A with `num_ctx` set to 1024 or 2048 will have different memory consumption.
Author
Owner

@JedrzejMajko commented on GitHub (Jul 3, 2025):

Ollama cannot operate one model many times . OLLAMA_NUM_PARALLEL do not apply how you think.
OLLAMA_NUM_PARALLEL just adds more memory, instead of reusing existing model loaded onto memory.

<!-- gh-comment-id:3032803991 --> @JedrzejMajko commented on GitHub (Jul 3, 2025): Ollama cannot operate one model many times . OLLAMA_NUM_PARALLEL do not apply how you think. OLLAMA_NUM_PARALLEL just adds more memory, instead of reusing existing model loaded onto memory.
Author
Owner

@mirage335 commented on GitHub (Jul 18, 2025):

I think the complexities of GPU shared memory, sparse layers, and detachable eGPU, will make it VERY difficult to avoid the need for many users to just force OLLAMA_NUM_PARALLEL=1 until consumer laptops and GPUs begin offering less unserious amounts of VRAM (ie. at least 24GB, maybe 32GB, instead of 16GB).

<!-- gh-comment-id:3090323904 --> @mirage335 commented on GitHub (Jul 18, 2025): I think the complexities of GPU shared memory, sparse layers, and detachable eGPU, will make it VERY difficult to avoid the need for many users to just force OLLAMA_NUM_PARALLEL=1 until consumer laptops and GPUs begin offering less unserious amounts of VRAM (ie. at least 24GB, maybe 32GB, instead of 16GB).
Author
Owner

@colinemondswieprecht commented on GitHub (Dec 22, 2025):

#11977 was just closed as a duplicate of this issue. @dhiltgen @jmorganca Is this issue here (#4170) even still being worked on? Nothing appears to have moved since this was closed as soon-to-be-completed in May 2024.

<!-- gh-comment-id:3682442961 --> @colinemondswieprecht commented on GitHub (Dec 22, 2025): #11977 was just closed as a duplicate of this issue. @dhiltgen @jmorganca Is this issue here (#4170) even still being worked on? Nothing appears to have moved since this was closed as soon-to-be-completed in May 2024.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28352