[GH-ISSUE #5398] OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS not having an effect on Ubuntu 22.04 LTS #3378

Closed
opened 2026-04-12 14:00:37 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @mrmiket64 on GitHub (Jul 1, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5398

What is the issue?

Short description: I have set "OLLAMA_NUM_PARALLEL=4" and "OLLAMA_MAX_LOADED_MODELS=2" but I cannot load two models at a time on Ollama 0.1.48

Note 1: The variables were having an effect and working as expected in an older Ollama version, I think it was v0.1.34.
Note 2: I asked a friend to please do the same test and for him worked fine on a Mac M1, Ollama 0.1.48.

Operative System: Ubuntu 22.04.4 LTS
Hardware:

  • Processor: i7 7700
  • RAM: 64Gb
  • GPU1: Nvidia 1070
  • GPU2: Nvidia 1070

Testing: Trying to load the models "mistral:7b-instruct-q8_0" and "llama3:8b-instruct-q8_0" at the same time, calling them with ollama run from two ssh remote connections, but only loaded one at a time. Confirmed with "ollama ps" and the inference ran first with one model, then with the other sequentially.
Screenshot 2024-07-01 at 12 08 22 a m

Here is a screenshot of the variables setup as shown with the command "sudo systemctl edit ollama"
Screenshot 2024-06-30 at 11 51 38 p m

Attached is the status of the service and as you can see, the variables are considered.
ollama_status_2.txt

Also attached are the logs captures during the test.
troubleshooting_logs.txt

Please let me know if more detail is needed.

Thank you
Mike

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.48

Originally created by @mrmiket64 on GitHub (Jul 1, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5398 ### What is the issue? Short description: I have set "OLLAMA_NUM_PARALLEL=4" and "OLLAMA_MAX_LOADED_MODELS=2" but I cannot load two models at a time on Ollama 0.1.48 Note 1: The variables were having an effect and working as expected in an older Ollama version, I think it was v0.1.34. Note 2: I asked a friend to please do the same test and for him worked fine on a Mac M1, Ollama 0.1.48. Operative System: Ubuntu 22.04.4 LTS Hardware: - Processor: i7 7700 - RAM: 64Gb - GPU1: Nvidia 1070 - GPU2: Nvidia 1070 Testing: Trying to load the models "mistral:7b-instruct-q8_0" and "llama3:8b-instruct-q8_0" at the same time, calling them with ollama run from two ssh remote connections, but only loaded one at a time. Confirmed with "ollama ps" and the inference ran first with one model, then with the other sequentially. <img width="925" alt="Screenshot 2024-07-01 at 12 08 22 a m" src="https://github.com/ollama/ollama/assets/99057519/f1664ea1-2897-4c80-8091-8caac0bf5e06"> Here is a screenshot of the variables setup as shown with the command "sudo systemctl edit ollama" <img width="828" alt="Screenshot 2024-06-30 at 11 51 38 p m" src="https://github.com/ollama/ollama/assets/99057519/c9f27da4-b3ad-4189-875c-5f71cd3f12a9"> Attached is the status of the service and as you can see, the variables are considered. [ollama_status_2.txt](https://github.com/user-attachments/files/16048447/ollama_status_2.txt) Also attached are the logs captures during the test. [troubleshooting_logs.txt](https://github.com/user-attachments/files/16048587/troubleshooting_logs.txt) Please let me know if more detail is needed. Thank you Mike ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.48
GiteaMirror added the bug label 2026-04-12 14:00:37 -05:00
Author
Owner

@dhiltgen commented on GitHub (Jul 2, 2024):

You do not have sufficient VRAM to fully load both models. The current implementation favors performance over partially loading secondary models. We've updated the FAQ to explain how concurrency works in more detail.

Jul 01 05:16:33 neuron0 ollama[20001]: time=2024-07-01T05:16:33.582Z level=INFO source=types.go:98 msg="inference compute" id=GPU-11bd15d7-3ae6-d4f2-e9dd-2a5dd4b0c618 library=cuda compute=6.1 driver=12.2 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="7.8 GiB"
Jul 01 05:16:33 neuron0 ollama[20001]: time=2024-07-01T05:16:33.582Z level=INFO source=types.go:98 msg="inference compute" id=GPU-b9d0fd05-27ae-d4f0-3db0-4c9ef3f6b354 library=cuda compute=6.1 driver=12.2 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="7.8 GiB"

Total VRAM: ~16G
The models you're trying to load require ~22G

<!-- gh-comment-id:2204356493 --> @dhiltgen commented on GitHub (Jul 2, 2024): You do not have sufficient VRAM to fully load both models. The current implementation favors performance over partially loading secondary models. We've updated the [FAQ](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests) to explain how concurrency works in more detail. ``` Jul 01 05:16:33 neuron0 ollama[20001]: time=2024-07-01T05:16:33.582Z level=INFO source=types.go:98 msg="inference compute" id=GPU-11bd15d7-3ae6-d4f2-e9dd-2a5dd4b0c618 library=cuda compute=6.1 driver=12.2 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="7.8 GiB" Jul 01 05:16:33 neuron0 ollama[20001]: time=2024-07-01T05:16:33.582Z level=INFO source=types.go:98 msg="inference compute" id=GPU-b9d0fd05-27ae-d4f0-3db0-4c9ef3f6b354 library=cuda compute=6.1 driver=12.2 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="7.8 GiB" ``` Total VRAM: ~16G The models you're trying to load require ~22G
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3378