[GH-ISSUE #8757] loading a new model pauses running inference #31443

Closed
opened 2026-04-22 11:53:31 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @jozsefszalma on GitHub (Feb 1, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8757

What is the issue?

I have two models:

  • model A is loaded into 4 GPUs, happily inferencing
  • model B is set to num_gpu 0 and otherwise works just fine using CPU cores.

The act of starting inferencing on model B (and ollama starts loading it into RAM) pauses model A's inferencing. There are no errors, just the GPU compute util drops to zero (the model remains loaded into the GPUs VRAM).
Once model B finished loading into RAM both models inference in parallel.
This is especially painful when talking about large models (100s of GB) that take minutes to load from an NVMe.

I'm running the requests through the API (with Open WebUI).

Environment variables:
OLLAMA_MAX_LOADED_MODELS=2
OLLAMA_KEEP_ALIVE=30m
OLLAMA_NUM_PARALLEL=2

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.5.7

Originally created by @jozsefszalma on GitHub (Feb 1, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8757 ### What is the issue? I have two models: - model A is loaded into 4 GPUs, happily inferencing - model B is set to num_gpu 0 and otherwise works just fine using CPU cores. The act of starting inferencing on model B (and ollama starts loading it into RAM) pauses model A's inferencing. There are no errors, just the GPU compute util drops to zero (the model remains loaded into the GPUs VRAM). Once model B finished loading into RAM both models inference in parallel. This is especially painful when talking about large models (100s of GB) that take minutes to load from an NVMe. I'm running the requests through the API (with Open WebUI). Environment variables: OLLAMA_MAX_LOADED_MODELS=2 OLLAMA_KEEP_ALIVE=30m OLLAMA_NUM_PARALLEL=2 ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.5.7
GiteaMirror added the bug label 2026-04-22 11:53:31 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 1, 2025):

I was unable to replicate. I opened a terminal and ran:

curl localhost:11434/api/generate -d '{"model":"deepseek-r1:7b","prompt":"why is the sky blue?"}'

Then opened a second terminal and when the first started to produce tokens, ran::

curl localhost:11434/api/generate -d '{"model":"qwen2.5:7b","prompt":"why is the sky blue?","options":{"num_gpu":0},"keep_alive":0}'

There was no pause in either token stream.

Server logs may provide insight. Can you provide some info on your system: version of Linux. ollama installation method, CPU (real/virtual), etc.

<!-- gh-comment-id:2629068933 --> @rick-github commented on GitHub (Feb 1, 2025): I was unable to replicate. I opened a terminal and ran: ``` curl localhost:11434/api/generate -d '{"model":"deepseek-r1:7b","prompt":"why is the sky blue?"}' ``` Then opened a second terminal and when the first started to produce tokens, ran:: ``` curl localhost:11434/api/generate -d '{"model":"qwen2.5:7b","prompt":"why is the sky blue?","options":{"num_gpu":0},"keep_alive":0}' ``` There was no pause in either token stream. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md) may provide insight. Can you provide some info on your system: version of Linux. ollama installation method, CPU (real/virtual), etc.
Author
Owner

@jozsefszalma commented on GitHub (Feb 1, 2025):

Hey Rick. Interestingly enough, I also couldn't reproduce the issue after I rebooted the system.
It's Ubuntu 22.04.4 LTS; ollama is installed with the https://ollama.com/install.sh script. It a physical box in my basement with an EPYC CPU, 512 GB RAM and some nvidia GPUs.
The smaller model on the GPUs was DeepSeek R1 Distill Llama 70B and the larger model being loaded into main memory was DeepSeek R1 671b.

I'm seeing this in the log from the time I observed the issue:

Feb 01 11:29:30 xxxx ollama[88576]: time=2025-02-01T11:29:30.979+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=2

Feb 01 11:29:30 xxxx ollama[88576]: time=2025-02-01T11:29:30.979+01:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"

Feb 01 11:29:30 xxxx ollama[88576]: time=2025-02-01T11:29:30.980+01:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"

Feb 01 11:29:30 xxxx ollama[88576]: time=2025-02-01T11:29:30.984+01:00 level=INFO source=runner.go:936 msg="starting go runner"

Feb 01 11:29:31 xxxx ollama[88576]: time=2025-02-01T11:29:31.006+01:00 level=INFO source=runner.go:937 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=12

Feb 01 11:29:31 xxxx ollama[88576]: time=2025-02-01T11:29:31.006+01:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:34885"

Feb 01 11:29:31 xxxx ollama[88576]: time=2025-02-01T11:29:31.231+01:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model

<!-- gh-comment-id:2629096174 --> @jozsefszalma commented on GitHub (Feb 1, 2025): Hey Rick. Interestingly enough, I also couldn't reproduce the issue after I rebooted the system. It's Ubuntu 22.04.4 LTS; ollama is installed with the https://ollama.com/install.sh script. It a physical box in my basement with an EPYC CPU, 512 GB RAM and some nvidia GPUs. The smaller model on the GPUs was DeepSeek R1 Distill Llama 70B and the larger model being loaded into main memory was DeepSeek R1 671b. I'm seeing this in the log from the time I observed the issue: Feb 01 11:29:30 xxxx ollama[88576]: time=2025-02-01T11:29:30.979+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=2 Feb 01 11:29:30 xxxx ollama[88576]: time=2025-02-01T11:29:30.979+01:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" Feb 01 11:29:30 xxxx ollama[88576]: time=2025-02-01T11:29:30.980+01:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" Feb 01 11:29:30 xxxx ollama[88576]: time=2025-02-01T11:29:30.984+01:00 level=INFO source=runner.go:936 msg="starting go runner" Feb 01 11:29:31 xxxx ollama[88576]: time=2025-02-01T11:29:31.006+01:00 level=INFO source=runner.go:937 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=12 Feb 01 11:29:31 xxxx ollama[88576]: time=2025-02-01T11:29:31.006+01:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:34885" Feb 01 11:29:31 xxxx ollama[88576]: time=2025-02-01T11:29:31.231+01:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model
Author
Owner

@rick-github commented on GitHub (Feb 1, 2025):

The two runners would be pretty independent, I can't think of a reason why a model wholly hosted in VRAM would suffer any effects of a model being loaded into memory - there's no page or memory thrashing, no pci lane contention, etc. Could the GPUs have been throttled internally by thermal slowdown or power cap?

<!-- gh-comment-id:2629099729 --> @rick-github commented on GitHub (Feb 1, 2025): The two runners would be pretty independent, I can't think of a reason why a model wholly hosted in VRAM would suffer any effects of a model being loaded into memory - there's no page or memory thrashing, no pci lane contention, etc. Could the GPUs have been throttled internally by thermal slowdown or power cap?
Author
Owner

@jozsefszalma commented on GitHub (Feb 1, 2025):

I was monitoring the GPUs with nvtop when this occurred, nothing unexpected was happening, thermals were well below limit. I can't recall if there was any significant swapping, but I'm pretty sure I still had 100G main memory free. Not sure how to check the rest retroactively.

Anyway, since this is going into the direction of my HW setup, perhaps we could put this aside and I report back when I see the issue again.

<!-- gh-comment-id:2629104157 --> @jozsefszalma commented on GitHub (Feb 1, 2025): I was monitoring the GPUs with nvtop when this occurred, nothing unexpected was happening, thermals were well below limit. I can't recall if there was any significant swapping, but I'm pretty sure I still had 100G main memory free. Not sure how to check the rest retroactively. Anyway, since this is going into the direction of my HW setup, perhaps we could put this aside and I report back when I see the issue again.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31443