[GH-ISSUE #7511] About OLLAMA_SCHED_SPREAD env,How to load a model on two GPUs #30538

Closed
opened 2026-04-22 10:14:29 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @Kouuh on GitHub (Nov 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7511

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Previously, I was using 515 version of NVIDIA driver and cuda11.7, and I could load a model (qwen1.5_32b_q8) on two GPUs by setting OLLAMA_SCHED_SPREAD=1.
But now I am using 535 version of NVIDIA driver and cuda12.1, and I want to load a model (qwen1.5_32b_q8) on two GPUs. Even if I set OLLAMA_SCHED_SPREAD=1, and tried ollamav_0.2.5 and ollama_v0.3.14, ollama will still run a complete model on two graphics cards. How can I solve this problem?

I run the service via command "CUDA_VISIBLE_DEVICES=6,7 /usr/bin/ollama serve"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.2.5 and 0.3.14

Originally created by @Kouuh on GitHub (Nov 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7511 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Previously, I was using 515 version of NVIDIA driver and cuda11.7, and I could load a model (qwen1.5_32b_q8) on two GPUs by setting OLLAMA_SCHED_SPREAD=1. But now I am using 535 version of NVIDIA driver and cuda12.1, and I want to load a model (qwen1.5_32b_q8) on two GPUs. Even if I set OLLAMA_SCHED_SPREAD=1, and tried ollamav_0.2.5 and ollama_v0.3.14, ollama will still run a complete model on two graphics cards. How can I solve this problem? I run the service via command "CUDA_VISIBLE_DEVICES=6,7 /usr/bin/ollama serve" ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.2.5 and 0.3.14
GiteaMirror added the nvidiabugneeds more info labels 2026-04-22 10:14:30 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 5, 2024):

ollama will still run a complete model on two graphics cards

Is this a typo or do I misunderstand the question? Isn't this what you want?

Server logs will aid in debugging.

<!-- gh-comment-id:2457338773 --> @rick-github commented on GitHub (Nov 5, 2024): > ollama will still run a complete model on two graphics cards Is this a typo or do I misunderstand the question? Isn't this what you want? [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@dhiltgen commented on GitHub (Nov 5, 2024):

In addition to the server log, I'd suggest using UUIDs instead of numeric IDs for the visible devices variable to ensure there isn't an ordering problem. You mention 6 and 7 implying you have quite a few GPUs, so it's possible you have older GPUs in the mix which may be causing us to fall back to cuda v11 for compatibility reasons, but it should still be able to load the model across multiple GPUs. In either case, the server log will help us understand what's going on.

<!-- gh-comment-id:2457637422 --> @dhiltgen commented on GitHub (Nov 5, 2024): In addition to the server log, I'd suggest using UUIDs instead of numeric IDs for the visible devices variable to ensure there isn't an ordering problem. You mention 6 and 7 implying you have quite a few GPUs, so it's possible you have older GPUs in the mix which may be causing us to fall back to cuda v11 for compatibility reasons, but it should still be able to load the model across multiple GPUs. In either case, the server log will help us understand what's going on.
Author
Owner

@Kouuh commented on GitHub (Nov 6, 2024):

ollama will still run a complete model on two graphics cards

Is this a typo or do I misunderstand the question? Isn't this what you want?

Server logs will aid in debugging.

Maybe I expressed it wrongly. What I want to achieve is to split a model into two pieces and then run on two GPU graphics cards. But now it's a full model running on each card.
I am using the qwen1.5_32b model using q8 quantization. The size of the local model file is 33G.

<!-- gh-comment-id:2458530521 --> @Kouuh commented on GitHub (Nov 6, 2024): > > ollama will still run a complete model on two graphics cards > > Is this a typo or do I misunderstand the question? Isn't this what you want? > > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging. Maybe I expressed it wrongly. What I want to achieve is to split a model into two pieces and then run on two GPU graphics cards. But now it's a full model running on each card. I am using the qwen1.5_32b model using q8 quantization. The size of the local model file is 33G.
Author
Owner

@Kouuh commented on GitHub (Nov 6, 2024):

IMG20241106094405

<!-- gh-comment-id:2458539290 --> @Kouuh commented on GitHub (Nov 6, 2024): ![IMG20241106094405](https://github.com/user-attachments/assets/b5f5b646-9cbf-48d1-a333-301bb02f0167)
Author
Owner

@Kouuh commented on GitHub (Nov 6, 2024):

IMG20241106095709
The above two pictures are the debug information I printed. Sorry, the information and images displayed may be incomplete.

<!-- gh-comment-id:2458565374 --> @Kouuh commented on GitHub (Nov 6, 2024): ![IMG20241106095709](https://github.com/user-attachments/assets/a9dc7150-a21d-44f8-be08-85e2717a0403) The above two pictures are the debug information I printed. Sorry, the information and images displayed may be incomplete.
Author
Owner

@rick-github commented on GitHub (Nov 6, 2024):

Please don't use photographs of your screen, add the logs as text, either pasted in a comment or as an attachment.

It's still not clear what you want to achieve. With OLLAMA_SCHED_SPREAD=1, ollama will split the model equally across all available cards. It sounds like this is what you want. It does not run an independent copy of the model on each card. What makes you think this is not working correctly?

<!-- gh-comment-id:2458910382 --> @rick-github commented on GitHub (Nov 6, 2024): Please don't use photographs of your screen, add the logs as text, either pasted in a comment or as an attachment. It's still not clear what you want to achieve. With `OLLAMA_SCHED_SPREAD=1`, ollama will split the model equally across all available cards. It sounds like this is what you want. It does not run an independent copy of the model on each card. What makes you think this is not working correctly?
Author
Owner

@Kouuh commented on GitHub (Nov 6, 2024):

Please don't use photographs of your screen, add the logs as text, either pasted in a comment or as an attachment.

It's still not clear what you want to achieve. With OLLAMA_SCHED_SPREAD=1, ollama will split the model equally across all available cards. It sounds like this is what you want. It does not run an independent copy of the model on each card. What makes you think this is not working correctly?

I used NVIDIA-SMI to check the memory occupation of memory.
I use two A100 (40G) display cards, and the weight of the model I want to use is 34G.
When I want to use two graphics cards to load the 34 G model, each graphics card will only occupy 17g of memory. But now I find that each card is actually occupied by 34g of memory.

Now its reasoning speed has become very slow. There are only 2 token per second.

<!-- gh-comment-id:2459036566 --> @Kouuh commented on GitHub (Nov 6, 2024): > Please don't use photographs of your screen, add the logs as text, either pasted in a comment or as an attachment. > > It's still not clear what you want to achieve. With `OLLAMA_SCHED_SPREAD=1`, ollama will split the model equally across all available cards. It sounds like this is what you want. It does not run an independent copy of the model on each card. What makes you think this is not working correctly? I used NVIDIA-SMI to check the memory occupation of memory. I use two A100 (40G) display cards, and the weight of the model I want to use is 34G. When I want to use two graphics cards to load the 34 G model, each graphics card will only occupy 17g of memory. But now I find that each card is actually occupied by 34g of memory. Now its reasoning speed has become very slow. There are only 2 token per second.
Author
Owner

@rick-github commented on GitHub (Nov 6, 2024):

Server logs should help in figuring out what is happening. Also include the output of nvidia-smi.

<!-- gh-comment-id:2459079268 --> @rick-github commented on GitHub (Nov 6, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) should help in figuring out what is happening. Also include the output of `nvidia-smi`.
Author
Owner

@rick-github commented on GitHub (Nov 6, 2024):

You have a large context window of 16000 tokens and OLLAMA_NUM_PARALLEL=8. This requires 31.2GB. Combined with the 34G of the model weights and the other memory required for supporting data structures, you need 90GB to host the model, so if fills up both cards and then partially spills in to system RAM, which is why your token speed is slow. Reduce the context window or reduce OLLAMA_NUM_PARALLEL.

<!-- gh-comment-id:2459095770 --> @rick-github commented on GitHub (Nov 6, 2024): You have a large context window of 16000 tokens and `OLLAMA_NUM_PARALLEL=8`. This requires 31.2GB. Combined with the 34G of the model weights and the other memory required for supporting data structures, you need 90GB to host the model, so if fills up both cards and then partially spills in to system RAM, which is why your token speed is slow. Reduce the context window or reduce `OLLAMA_NUM_PARALLEL`.
Author
Owner

@Kouuh commented on GitHub (Nov 6, 2024):

You have a large context window of 16000 tokens and OLLAMA_NUM_PARALLEL=8. This requires 31.2GB. Combined with the 34G of the model weights and the other memory required for supporting data structures, you need 90GB to host the model, so if fills up both cards and then partially spills in to system RAM, which is why your token speed is slow. Reduce the context window or reduce OLLAMA_NUM_PARALLEL.

Thank you, I have solved the slow inference problem by lowering num_ctx. I would like to ask why setting the value of num_ctx too high will cause the inference speed to slow down.

<!-- gh-comment-id:2459301239 --> @Kouuh commented on GitHub (Nov 6, 2024): > You have a large context window of 16000 tokens and `OLLAMA_NUM_PARALLEL=8`. This requires 31.2GB. Combined with the 34G of the model weights and the other memory required for supporting data structures, you need 90GB to host the model, so if fills up both cards and then partially spills in to system RAM, which is why your token speed is slow. Reduce the context window or reduce `OLLAMA_NUM_PARALLEL`. Thank you, I have solved the slow inference problem by lowering num_ctx. I would like to ask why setting the value of num_ctx too high will cause the inference speed to slow down.
Author
Owner

@rick-github commented on GitHub (Nov 6, 2024):

num_ctx is the space used for processing the tokens that are fed in to the model, like short term memory. The larger the context space, the more VRAM needs to be allocated on the GPU to hold it. If there's not enough room in the VRAM to hold both the context and the model weights, the model weights get moved to system RAM and are processed by the CPU. The CPU is slower than the GPU, so token generation speed decreases.

<!-- gh-comment-id:2459414436 --> @rick-github commented on GitHub (Nov 6, 2024): `num_ctx` is the space used for processing the tokens that are fed in to the model, like short term memory. The larger the context space, the more VRAM needs to be allocated on the GPU to hold it. If there's not enough room in the VRAM to hold both the context and the model weights, the model weights get moved to system RAM and are processed by the CPU. The CPU is slower than the GPU, so token generation speed decreases.
Author
Owner

@YUHSINCHENG1230 commented on GitHub (Nov 14, 2024):

where to set num_ctx?

<!-- gh-comment-id:2475574537 --> @YUHSINCHENG1230 commented on GitHub (Nov 14, 2024): where to set num_ctx?
Author
Owner
<!-- gh-comment-id:2476362055 --> @rick-github commented on GitHub (Nov 14, 2024): https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30538