[GH-ISSUE #9625] one model loaded multiple times hogging whole available memory #52793

New Issue

GiteaMirror · 2026-04-29T00:54:09-05:00

GiteaMirror commented

2026-04-29 00:54:09 -05:00

Originally created by @tendermonster on GitHub (Mar 10, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9625

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Currently I'm using tabby and openweb-ui with a coding model. I have notices that after some time the GPU's run out of memory even though only one model is loaded that should only occupy about 10gb of ram.

When this point is reached the requests to ollama never complete and freeze indefinitely. At the point the only option is to restart the service. It seems that ollama do not clean up loaded models from memory.

How can i debug this ? Is there any setting to force to have only one model loaded ?

Help is appreciated

Relevant log output

ollama version is 0.5.7
following errors occur in the log file (the logs here are cherry picked that could be of use):
level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.224019477 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
ollama[2186]: time=2025-03-10T16:26:44.218+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.474039067 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
ollama[2186]: time=2025-03-10T16:26:44.467+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.723185116 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed

level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
level=ERROR source=sched.go:325 msg="finished request signal received after model unloaded" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.7

Originally created by @tendermonster on GitHub (Mar 10, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9625 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Currently I'm using tabby and openweb-ui with a coding model. I have notices that after some time the GPU's run out of memory even though only one model is loaded that should only occupy about 10gb of ram. When this point is reached the requests to ollama never complete and freeze indefinitely. At the point the only option is to restart the service. It seems that ollama do not clean up loaded models from memory. How can i debug this ? Is there any setting to force to have only one model loaded ? Help is appreciated ![Image](https://github.com/user-attachments/assets/896b468d-4e9d-4853-97c3-d9911f8963e0) ### Relevant log output ```shell ollama version is 0.5.7 following errors occur in the log file (the logs here are cherry picked that could be of use): level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.224019477 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed ollama[2186]: time=2025-03-10T16:26:44.218+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.474039067 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed ollama[2186]: time=2025-03-10T16:26:44.467+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.723185116 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" level=ERROR source=sched.go:325 msg="finished request signal received after model unloaded" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.7

GiteaMirror added the gpu bug nvidia labels 2026-04-29 00:54:10 -05:00

GiteaMirror commented

2026-04-29 00:54:12 -05:00

@rick-github commented on GitHub (Mar 10, 2025):

Is there any setting to force to have only one model loaded ?

Set OLLAMA_MAX_LOADED_MODELS=1 in the server environment.

However, ollama should automatically unload models when space is tight and a model is being loaded. Your screenshot shows four runners, and if you are loading only one model, it could be that the server is crashing and orphaning the runners. The server log should show what's happening.

@rick-github commented on GitHub (Mar 10, 2025): > Is there any setting to force to have only one model loaded ? Set [`OLLAMA_MAX_LOADED_MODELS=1`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests:~:text=OLLAMA_MAX_LOADED_MODELS) in the server environment. However, ollama should automatically unload models when space is tight and a model is being loaded. Your screenshot shows four runners, and if you are loading only one model, it could be that the server is crashing and orphaning the runners. The server log should show what's happening.

GiteaMirror commented

2026-04-29 00:54:13 -05:00

@tendermonster commented on GitHub (Mar 10, 2025):

Is there any setting to force to have only one model loaded ?

Set OLLAMA_MAX_LOADED_MODELS=1 in the server environment.

However, ollama should automatically unload models when space is tight and a model is being loaded. Your screenshot shows four runners, and if you are loading only one model, it could be that the server is crashing and orphaning the runners. The server log should show what's happening.

I assume that setting OLLAMA_MAX_LOADED_MODELS variable would not really solve the problem. It would only limit loading N models per GPU. With that, should the model not be properly offloaded from GPU ollama might just load other models to CPU. I will go trough logs and if I find the culprit will let you know. Otherwise i might publish full log output if I'm unable to pinpoint the cause

@tendermonster commented on GitHub (Mar 10, 2025): > > Is there any setting to force to have only one model loaded ? > > Set [`OLLAMA_MAX_LOADED_MODELS=1`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests:~:text=OLLAMA_MAX_LOADED_MODELS) in the server environment. > > However, ollama should automatically unload models when space is tight and a model is being loaded. Your screenshot shows four runners, and if you are loading only one model, it could be that the server is crashing and orphaning the runners. The server log should show what's happening. I assume that setting OLLAMA_MAX_LOADED_MODELS variable would not really solve the problem. It would only limit loading N models per GPU. With that, should the model not be properly offloaded from GPU ollama might just load other models to CPU. I will go trough logs and if I find the culprit will let you know. Otherwise i might publish full log output if I'm unable to pinpoint the cause

GiteaMirror commented

2026-04-29 00:54:14 -05:00

@tendermonster commented on GitHub (Mar 10, 2025):

could closed connection be the main reason for such behavior? If so is it somehow preventable that if connection would for any reason be closed prematurely the model would offload properly ?

This is basically the recurring theme:

Feb 28 01:25:01  ollama[1794]: time=2025-02-28T01:25:01.793+01:00 level=WARN source=server.go:562 **msg="client connection closed before server finished loading, aborting load"**
Feb 28 01:25:01  ollama[1794]: time=2025-02-28T01:25:01.793+01:00 level=ERROR source=sched.go:455 **msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"**
Feb 28 01:25:01  ollama[1794]: [GIN] 2025/02/28 - 01:25:01 | 499 |  6.999396951s |       127.0.0.1 | POST     "/api/generate"
Feb 28 01:25:06  ollama[1794]: time=2025-02-28T01:25:06.440+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=tfs_z
Feb 28 01:25:06  ollama[1794]: time=2025-02-28T01:25:06.440+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=num_gqa
Feb 28 01:25:06  ollama[1794]: time=2025-02-28T01:25:06.808+01:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed gpu=GPU-2f1ff6aa-d9f5-778a-0e93-fafb4a0333f9 parallel=4 available=20953956352 required="10.8 GiB"
Feb 28 01:25:06  ollama[1794]: time=2025-02-28T01:25:06.952+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.158715694 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.171+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.377998162 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.278+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=num_gqa
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.278+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=tfs_z
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.391+01:00 level=INFO source=server.go:104 msg="system memory" total="251.4 GiB" free="243.8 GiB" free_swap="980.0 MiB"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.392+01:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[19.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.8 GiB" memory.required.partial="10.8 GiB" memory.required.kv="1.5 GiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.3 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 45527"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.394+01:00 level=WARN source=server.go:562 msg="client connection closed before server finished loading, aborting load"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.394+01:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
Feb 28 01:25:07  ollama[1794]: [GIN] 2025/02/28 - 01:25:07 | 499 |  993.695663ms |       127.0.0.1 | POST     "/api/generate"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.466+01:00 level=INFO source=runner.go:936 msg="starting go runner"
Feb 28 01:25:07  ollama[1794]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Feb 28 01:25:07  ollama[1794]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Feb 28 01:25:07  ollama[1794]: ggml_cuda_init: found 1 CUDA devices:
Feb 28 01:25:07  ollama[1794]:   Device 0: NVIDIA RTX A4500, compute capability 8.6, VMM: yes
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.486+01:00 level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=18
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.487+01:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:45527"
Feb 28 01:25:07  ollama[1794]: llama_load_model_from_file: using device CUDA0 (NVIDIA RTX A4500) - 19983 MiB free
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.653+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.860210936 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
Feb 28 01:25:07  ollama[1794]: llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed (version GGUF V3 (latest))
Feb 28 01:25:07  ollama[1794]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.

As it seems that connection errors are the problem and I'm running ollama behind nginx could it be that I'm missing some configuration?

This is the current config of Tabby:

location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

can it be that the websocet support is missing ?

        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

@tendermonster commented on GitHub (Mar 10, 2025): could closed connection be the main reason for such behavior? If so is it somehow preventable that if connection would for any reason be closed prematurely the model would offload properly ? This is basically the recurring theme: <details> ``` Feb 28 01:25:01 ollama[1794]: time=2025-02-28T01:25:01.793+01:00 level=WARN source=server.go:562 **msg="client connection closed before server finished loading, aborting load"** Feb 28 01:25:01 ollama[1794]: time=2025-02-28T01:25:01.793+01:00 level=ERROR source=sched.go:455 **msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"** Feb 28 01:25:01 ollama[1794]: [GIN] 2025/02/28 - 01:25:01 | 499 | 6.999396951s | 127.0.0.1 | POST "/api/generate" Feb 28 01:25:06 ollama[1794]: time=2025-02-28T01:25:06.440+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=tfs_z Feb 28 01:25:06 ollama[1794]: time=2025-02-28T01:25:06.440+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=num_gqa Feb 28 01:25:06 ollama[1794]: time=2025-02-28T01:25:06.808+01:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed gpu=GPU-2f1ff6aa-d9f5-778a-0e93-fafb4a0333f9 parallel=4 available=20953956352 required="10.8 GiB" Feb 28 01:25:06 ollama[1794]: time=2025-02-28T01:25:06.952+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.158715694 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.171+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.377998162 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.278+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=num_gqa Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.278+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=tfs_z Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.391+01:00 level=INFO source=server.go:104 msg="system memory" total="251.4 GiB" free="243.8 GiB" free_swap="980.0 MiB" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.392+01:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[19.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.8 GiB" memory.required.partial="10.8 GiB" memory.required.kv="1.5 GiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.3 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 45527" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.394+01:00 level=WARN source=server.go:562 msg="client connection closed before server finished loading, aborting load" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.394+01:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" Feb 28 01:25:07 ollama[1794]: [GIN] 2025/02/28 - 01:25:07 | 499 | 993.695663ms | 127.0.0.1 | POST "/api/generate" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.466+01:00 level=INFO source=runner.go:936 msg="starting go runner" Feb 28 01:25:07 ollama[1794]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Feb 28 01:25:07 ollama[1794]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Feb 28 01:25:07 ollama[1794]: ggml_cuda_init: found 1 CUDA devices: Feb 28 01:25:07 ollama[1794]: Device 0: NVIDIA RTX A4500, compute capability 8.6, VMM: yes Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.486+01:00 level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=18 Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.487+01:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:45527" Feb 28 01:25:07 ollama[1794]: llama_load_model_from_file: using device CUDA0 (NVIDIA RTX A4500) - 19983 MiB free Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.653+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.860210936 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed Feb 28 01:25:07 ollama[1794]: llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed (version GGUF V3 (latest)) Feb 28 01:25:07 ollama[1794]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ``` </details> As it seems that connection errors are the problem and I'm running ollama behind nginx could it be that I'm missing some configuration? This is the current config of Tabby: <details> ``` location / { proxy_pass http://localhost:8080; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } ``` </details> can it be that the websocet support is missing ? <details> ``` proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; ``` </details>

GiteaMirror commented

2026-04-29 00:54:14 -05:00

@rick-github commented on GitHub (Mar 10, 2025):

could closed connection be the main reason for such behavior?

Unlikely. The log snippet shows the client disconnecting before the model is ready. ollama will just discard the model load, it wouldn't lead to multiple runners. nginx logs may show why the ollama client is terminating early, either because of an nginx timeout or an nginx client timeout.

@rick-github commented on GitHub (Mar 10, 2025): > could closed connection be the main reason for such behavior? Unlikely. The log snippet shows the client disconnecting before the model is ready. ollama will just discard the model load, it wouldn't lead to multiple runners. nginx logs may show why the ollama client is terminating early, either because of an nginx timeout or an nginx client timeout.

GiteaMirror commented

2026-04-29 00:54:14 -05:00

@tendermonster commented on GitHub (Mar 10, 2025):

hmm, the runners so seem to break somehow. If you have any ideas of what I could debug next let me know. as of now I'm our of ideas

@tendermonster commented on GitHub (Mar 10, 2025): hmm, the runners so seem to break somehow. If you have any ideas of what I could debug next let me know. as of now I'm our of ideas

GiteaMirror commented

2026-04-29 00:54:16 -05:00

@rick-github commented on GitHub (Mar 10, 2025):

You could just add the server log.

@rick-github commented on GitHub (Mar 10, 2025): You could just add the server log.

GiteaMirror commented

2026-04-29 00:54:19 -05:00

@tendermonster commented on GitHub (Mar 10, 2025):

oke. if you see anything let me know

ollama.log

@tendermonster commented on GitHub (Mar 10, 2025): oke. if you see anything let me know [ollama.log](https://github.com/user-attachments/files/19169441/ollama.log)

GiteaMirror commented

2026-04-29 00:54:20 -05:00

@rick-github commented on GitHub (Mar 10, 2025):

When it gets stuck, what's the output of ollama ps?

@rick-github commented on GitHub (Mar 10, 2025): When it gets stuck, what's the output of `ollama ps`?

GiteaMirror commented

2026-04-29 00:54:20 -05:00

@tendermonster commented on GitHub (Mar 10, 2025):

See the screenshot above. It seems by the time I did the screenshot the model was loaded on cpu. The main suspect is the coder model. Also indicated in the logs as it mostly comes up when connection is closed faster then ollama can load the model

@tendermonster commented on GitHub (Mar 10, 2025): See the screenshot above. It seems by the time I did the screenshot the model was loaded on cpu. The main suspect is the coder model. Also indicated in the logs as it mostly comes up when connection is closed faster then ollama can load the model

GiteaMirror commented

2026-04-29 00:54:21 -05:00

@rick-github commented on GitHub (Mar 10, 2025):

Well, it's not the server that's creating orphan processes. Over the life of a single server, the model goes from fitting on one GPU, to being split across two GPUs, then having fewer and fewer layers offloaded, num_parallel reduced, and then eventually nothing fits on the GPU and the model is loaded into CPU. And all in the space of 44 seconds.

Mär 10 16:26:53 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 36189"
Mär 10 16:26:58 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 43987"
Mär 10 16:27:11 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 33571"
Mär 10 16:27:15 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 35803"
Mär 10 16:27:19 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 43191"
Mär 10 16:27:23 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 40737"
Mär 10 16:27:27 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 45649"
Mär 10 16:27:31 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 42417"
Mär 10 16:27:35 ollama[2186]: cmd="ollama/runners/cpu_avx2/ollama_llama_server runner  --ctx-size 2048 --batch-size 512 --threads 18 --no-mmap --parallel 1 --port 36049"
Mär 10 16:27:37 ollama[2186]: cmd="ollama/runners/cpu_avx2/ollama_llama_server runner  --ctx-size 2048 --batch-size 512 --threads 18 --no-mmap --parallel 1 --port 38303"

There may be some race condition in the runner handler that results in the runner staying active even when the handler thinks it's been terminated due to the client disconnecting, but I've been unable to replicate so far. Apart from tfs_z and num_gqa, does the client set any other options?

@rick-github commented on GitHub (Mar 10, 2025): Well, it's not the server that's creating orphan processes. Over the life of a single server, the model goes from fitting on one GPU, to being split across two GPUs, then having fewer and fewer layers offloaded, num_parallel reduced, and then eventually nothing fits on the GPU and the model is loaded into CPU. And all in the space of 44 seconds. ``` Mär 10 16:26:53 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 36189" Mär 10 16:26:58 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 43987" Mär 10 16:27:11 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 33571" Mär 10 16:27:15 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 35803" Mär 10 16:27:19 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 43191" Mär 10 16:27:23 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 40737" Mär 10 16:27:27 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 45649" Mär 10 16:27:31 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 42417" Mär 10 16:27:35 ollama[2186]: cmd="ollama/runners/cpu_avx2/ollama_llama_server runner --ctx-size 2048 --batch-size 512 --threads 18 --no-mmap --parallel 1 --port 36049" Mär 10 16:27:37 ollama[2186]: cmd="ollama/runners/cpu_avx2/ollama_llama_server runner --ctx-size 2048 --batch-size 512 --threads 18 --no-mmap --parallel 1 --port 38303" ``` There may be some race condition in the runner handler that results in the runner staying active even when the handler thinks it's been terminated due to the client disconnecting, but I've been unable to replicate so far. Apart from `tfs_z` and `num_gqa`, does the client set any other options?

GiteaMirror commented

2026-04-29 00:54:21 -05:00

@tendermonster commented on GitHub (Mar 10, 2025):

These would be the options for tabby and openweb-ui. I'm not setting any custom options when loading the model.
I'm not surprised that some options may be incompatible. Should the same thing happen, I'll send the debug log with OLLAMA_DEBUG=1 for more infos.

using:
tabby 0.25.0
opneweb-ui: 0.5.20

@tendermonster commented on GitHub (Mar 10, 2025): These would be the options for tabby and openweb-ui. I'm not setting any custom options when loading the model. I'm not surprised that some options may be incompatible. Should the same thing happen, I'll send the debug log with OLLAMA_DEBUG=1 for more infos. using: tabby 0.25.0 opneweb-ui: 0.5.20 - [tabby uses ollama-rs with](https://github.com/TabbyML/tabby/blob/bb91366019565b4de0f4c112867ab752d539e2af/crates/ollama-api-bindings/src/completion.rs#L22) - [full list of ollama-rs options](https://github.com/pepperoni21/ollama-rs/blob/19e30178f40dabd922aad1f80955464f4bb2a67a/ollama-rs/src/generation/options.rs#L5) - [openweb-ui options](https://github.com/open-webui/open-webui/blob/d7bfa395b0672a21a41fb6706a4275673d339762/backend/open_webui/utils/payload.py#L83)

GiteaMirror commented

2026-04-29 00:54:22 -05:00

@tendermonster commented on GitHub (Mar 12, 2025):

so this time i got log with debug output when the model is not unloaded properly

trouble starts from Mär 11 13:00:16

if it sufficient to identify the main cause with this output ?

ollama_debug_cut.log

@tendermonster commented on GitHub (Mar 12, 2025): so this time i got log with debug output when the model is not unloaded properly trouble starts from Mär 11 13:00:16 if it sufficient to identify the main cause with this output ? [ollama_debug_cut.log](https://github.com/user-attachments/files/19212996/ollama_debug_cut.log)

GiteaMirror commented

2026-04-29 00:54:22 -05:00

@tendermonster commented on GitHub (Mar 17, 2025):

as a temporary workaround using cronjob to restart ollama when the problem occurs

import subprocess
def is_ollama_in_gpu_memory():
    try:
        result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
        return "ollama_llama_server" in result.stdout
    except Exception as e:
        print(f"Error checking GPU memory: {e}")
        return False
    
def is_ollama_ps_empty():
    try:
        result = subprocess.run(["ollama", "ps"], capture_output=True, text=True)
        return "NAME    ID    SIZE    PROCESSOR    UNTIL \n" == result.stdout
    except Exception as e:
        print(f"Error checking ollama models: {e}")
        return True

def restart_ollama_service():
    try:
        subprocess.run(["sudo", "systemctl", "restart", "ollama"], check=True)
    except Exception as e:
        print(f"Failed to restart ollama service: {e}")
if __name__ == "__main__":
    if is_ollama_in_gpu_memory() and is_ollama_ps_empty():
        print(
            "Detected `ollama_llama_server` in GPU memory but no models loaded. Restarting service..."
        )
        restart_ollama_service()
    else:
        print("Ollama status is normal. No action needed.")
``

@tendermonster commented on GitHub (Mar 17, 2025): as a temporary workaround using cronjob to restart ollama when the problem occurs ```python import subprocess def is_ollama_in_gpu_memory(): try: result = subprocess.run(["nvidia-smi"], capture_output=True, text=True) return "ollama_llama_server" in result.stdout except Exception as e: print(f"Error checking GPU memory: {e}") return False def is_ollama_ps_empty(): try: result = subprocess.run(["ollama", "ps"], capture_output=True, text=True) return "NAME ID SIZE PROCESSOR UNTIL \n" == result.stdout except Exception as e: print(f"Error checking ollama models: {e}") return True def restart_ollama_service(): try: subprocess.run(["sudo", "systemctl", "restart", "ollama"], check=True) except Exception as e: print(f"Failed to restart ollama service: {e}") if __name__ == "__main__": if is_ollama_in_gpu_memory() and is_ollama_ps_empty(): print( "Detected `ollama_llama_server` in GPU memory but no models loaded. Restarting service..." ) restart_ollama_service() else: print("Ollama status is normal. No action needed.") ``

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

dhiltgen/llama-runner

hoyyeva/anthropic-local-image-path

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#52793