[GH-ISSUE #9625] one model loaded multiple times hogging whole available memory #52793

Open
opened 2026-04-29 00:54:09 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @tendermonster on GitHub (Mar 10, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9625

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Currently I'm using tabby and openweb-ui with a coding model. I have notices that after some time the GPU's run out of memory even though only one model is loaded that should only occupy about 10gb of ram.

When this point is reached the requests to ollama never complete and freeze indefinitely. At the point the only option is to restart the service. It seems that ollama do not clean up loaded models from memory.

How can i debug this ? Is there any setting to force to have only one model loaded ?

Help is appreciated

Image

Relevant log output

ollama version is 0.5.7
following errors occur in the log file (the logs here are cherry picked that could be of use):
level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.224019477 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
ollama[2186]: time=2025-03-10T16:26:44.218+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.474039067 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
ollama[2186]: time=2025-03-10T16:26:44.467+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.723185116 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed

level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
level=ERROR source=sched.go:325 msg="finished request signal received after model unloaded" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.7

Originally created by @tendermonster on GitHub (Mar 10, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9625 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Currently I'm using tabby and openweb-ui with a coding model. I have notices that after some time the GPU's run out of memory even though only one model is loaded that should only occupy about 10gb of ram. When this point is reached the requests to ollama never complete and freeze indefinitely. At the point the only option is to restart the service. It seems that ollama do not clean up loaded models from memory. How can i debug this ? Is there any setting to force to have only one model loaded ? Help is appreciated ![Image](https://github.com/user-attachments/assets/896b468d-4e9d-4853-97c3-d9911f8963e0) ### Relevant log output ```shell ollama version is 0.5.7 following errors occur in the log file (the logs here are cherry picked that could be of use): level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.224019477 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed ollama[2186]: time=2025-03-10T16:26:44.218+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.474039067 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed ollama[2186]: time=2025-03-10T16:26:44.467+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.723185116 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" level=ERROR source=sched.go:325 msg="finished request signal received after model unloaded" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.7
GiteaMirror added the gpubugnvidia labels 2026-04-29 00:54:10 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 10, 2025):

Is there any setting to force to have only one model loaded ?

Set OLLAMA_MAX_LOADED_MODELS=1 in the server environment.

However, ollama should automatically unload models when space is tight and a model is being loaded. Your screenshot shows four runners, and if you are loading only one model, it could be that the server is crashing and orphaning the runners. The server log should show what's happening.

<!-- gh-comment-id:2711164074 --> @rick-github commented on GitHub (Mar 10, 2025): > Is there any setting to force to have only one model loaded ? Set [`OLLAMA_MAX_LOADED_MODELS=1`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests:~:text=OLLAMA_MAX_LOADED_MODELS) in the server environment. However, ollama should automatically unload models when space is tight and a model is being loaded. Your screenshot shows four runners, and if you are loading only one model, it could be that the server is crashing and orphaning the runners. The server log should show what's happening.
Author
Owner

@tendermonster commented on GitHub (Mar 10, 2025):

Is there any setting to force to have only one model loaded ?

Set OLLAMA_MAX_LOADED_MODELS=1 in the server environment.

However, ollama should automatically unload models when space is tight and a model is being loaded. Your screenshot shows four runners, and if you are loading only one model, it could be that the server is crashing and orphaning the runners. The server log should show what's happening.

I assume that setting OLLAMA_MAX_LOADED_MODELS variable would not really solve the problem. It would only limit loading N models per GPU. With that, should the model not be properly offloaded from GPU ollama might just load other models to CPU. I will go trough logs and if I find the culprit will let you know. Otherwise i might publish full log output if I'm unable to pinpoint the cause

<!-- gh-comment-id:2711478343 --> @tendermonster commented on GitHub (Mar 10, 2025): > > Is there any setting to force to have only one model loaded ? > > Set [`OLLAMA_MAX_LOADED_MODELS=1`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests:~:text=OLLAMA_MAX_LOADED_MODELS) in the server environment. > > However, ollama should automatically unload models when space is tight and a model is being loaded. Your screenshot shows four runners, and if you are loading only one model, it could be that the server is crashing and orphaning the runners. The server log should show what's happening. I assume that setting OLLAMA_MAX_LOADED_MODELS variable would not really solve the problem. It would only limit loading N models per GPU. With that, should the model not be properly offloaded from GPU ollama might just load other models to CPU. I will go trough logs and if I find the culprit will let you know. Otherwise i might publish full log output if I'm unable to pinpoint the cause
Author
Owner

@tendermonster commented on GitHub (Mar 10, 2025):

could closed connection be the main reason for such behavior? If so is it somehow preventable that if connection would for any reason be closed prematurely the model would offload properly ?

This is basically the recurring theme:

Feb 28 01:25:01  ollama[1794]: time=2025-02-28T01:25:01.793+01:00 level=WARN source=server.go:562 **msg="client connection closed before server finished loading, aborting load"**
Feb 28 01:25:01  ollama[1794]: time=2025-02-28T01:25:01.793+01:00 level=ERROR source=sched.go:455 **msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"**
Feb 28 01:25:01  ollama[1794]: [GIN] 2025/02/28 - 01:25:01 | 499 |  6.999396951s |       127.0.0.1 | POST     "/api/generate"
Feb 28 01:25:06  ollama[1794]: time=2025-02-28T01:25:06.440+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=tfs_z
Feb 28 01:25:06  ollama[1794]: time=2025-02-28T01:25:06.440+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=num_gqa
Feb 28 01:25:06  ollama[1794]: time=2025-02-28T01:25:06.808+01:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed gpu=GPU-2f1ff6aa-d9f5-778a-0e93-fafb4a0333f9 parallel=4 available=20953956352 required="10.8 GiB"
Feb 28 01:25:06  ollama[1794]: time=2025-02-28T01:25:06.952+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.158715694 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.171+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.377998162 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.278+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=num_gqa
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.278+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=tfs_z
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.391+01:00 level=INFO source=server.go:104 msg="system memory" total="251.4 GiB" free="243.8 GiB" free_swap="980.0 MiB"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.392+01:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[19.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.8 GiB" memory.required.partial="10.8 GiB" memory.required.kv="1.5 GiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.3 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 45527"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.394+01:00 level=WARN source=server.go:562 msg="client connection closed before server finished loading, aborting load"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.394+01:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
Feb 28 01:25:07  ollama[1794]: [GIN] 2025/02/28 - 01:25:07 | 499 |  993.695663ms |       127.0.0.1 | POST     "/api/generate"
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.466+01:00 level=INFO source=runner.go:936 msg="starting go runner"
Feb 28 01:25:07  ollama[1794]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Feb 28 01:25:07  ollama[1794]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Feb 28 01:25:07  ollama[1794]: ggml_cuda_init: found 1 CUDA devices:
Feb 28 01:25:07  ollama[1794]:   Device 0: NVIDIA RTX A4500, compute capability 8.6, VMM: yes
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.486+01:00 level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=18
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.487+01:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:45527"
Feb 28 01:25:07  ollama[1794]: llama_load_model_from_file: using device CUDA0 (NVIDIA RTX A4500) - 19983 MiB free
Feb 28 01:25:07  ollama[1794]: time=2025-02-28T01:25:07.653+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.860210936 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed
Feb 28 01:25:07  ollama[1794]: llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed (version GGUF V3 (latest))
Feb 28 01:25:07  ollama[1794]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.

As it seems that connection errors are the problem and I'm running ollama behind nginx could it be that I'm missing some configuration?

This is the current config of Tabby:

location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

can it be that the websocet support is missing ?

        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
<!-- gh-comment-id:2711562358 --> @tendermonster commented on GitHub (Mar 10, 2025): could closed connection be the main reason for such behavior? If so is it somehow preventable that if connection would for any reason be closed prematurely the model would offload properly ? This is basically the recurring theme: <details> ``` Feb 28 01:25:01 ollama[1794]: time=2025-02-28T01:25:01.793+01:00 level=WARN source=server.go:562 **msg="client connection closed before server finished loading, aborting load"** Feb 28 01:25:01 ollama[1794]: time=2025-02-28T01:25:01.793+01:00 level=ERROR source=sched.go:455 **msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"** Feb 28 01:25:01 ollama[1794]: [GIN] 2025/02/28 - 01:25:01 | 499 | 6.999396951s | 127.0.0.1 | POST "/api/generate" Feb 28 01:25:06 ollama[1794]: time=2025-02-28T01:25:06.440+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=tfs_z Feb 28 01:25:06 ollama[1794]: time=2025-02-28T01:25:06.440+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=num_gqa Feb 28 01:25:06 ollama[1794]: time=2025-02-28T01:25:06.808+01:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed gpu=GPU-2f1ff6aa-d9f5-778a-0e93-fafb4a0333f9 parallel=4 available=20953956352 required="10.8 GiB" Feb 28 01:25:06 ollama[1794]: time=2025-02-28T01:25:06.952+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.158715694 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.171+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.377998162 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.278+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=num_gqa Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.278+01:00 level=WARN source=types.go:512 msg="invalid option provided" option=tfs_z Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.391+01:00 level=INFO source=server.go:104 msg="system memory" total="251.4 GiB" free="243.8 GiB" free_swap="980.0 MiB" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.392+01:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[19.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.8 GiB" memory.required.partial="10.8 GiB" memory.required.kv="1.5 GiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.3 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 45527" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.393+01:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.394+01:00 level=WARN source=server.go:562 msg="client connection closed before server finished loading, aborting load" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.394+01:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" Feb 28 01:25:07 ollama[1794]: [GIN] 2025/02/28 - 01:25:07 | 499 | 993.695663ms | 127.0.0.1 | POST "/api/generate" Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.466+01:00 level=INFO source=runner.go:936 msg="starting go runner" Feb 28 01:25:07 ollama[1794]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Feb 28 01:25:07 ollama[1794]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Feb 28 01:25:07 ollama[1794]: ggml_cuda_init: found 1 CUDA devices: Feb 28 01:25:07 ollama[1794]: Device 0: NVIDIA RTX A4500, compute capability 8.6, VMM: yes Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.486+01:00 level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=18 Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.487+01:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:45527" Feb 28 01:25:07 ollama[1794]: llama_load_model_from_file: using device CUDA0 (NVIDIA RTX A4500) - 19983 MiB free Feb 28 01:25:07 ollama[1794]: time=2025-02-28T01:25:07.653+01:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.860210936 model=/usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed Feb 28 01:25:07 ollama[1794]: llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed (version GGUF V3 (latest)) Feb 28 01:25:07 ollama[1794]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ``` </details> As it seems that connection errors are the problem and I'm running ollama behind nginx could it be that I'm missing some configuration? This is the current config of Tabby: <details> ``` location / { proxy_pass http://localhost:8080; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } ``` </details> can it be that the websocet support is missing ? <details> ``` proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; ``` </details>
Author
Owner

@rick-github commented on GitHub (Mar 10, 2025):

could closed connection be the main reason for such behavior?

Unlikely. The log snippet shows the client disconnecting before the model is ready. ollama will just discard the model load, it wouldn't lead to multiple runners. nginx logs may show why the ollama client is terminating early, either because of an nginx timeout or an nginx client timeout.

<!-- gh-comment-id:2711585844 --> @rick-github commented on GitHub (Mar 10, 2025): > could closed connection be the main reason for such behavior? Unlikely. The log snippet shows the client disconnecting before the model is ready. ollama will just discard the model load, it wouldn't lead to multiple runners. nginx logs may show why the ollama client is terminating early, either because of an nginx timeout or an nginx client timeout.
Author
Owner

@tendermonster commented on GitHub (Mar 10, 2025):

hmm, the runners so seem to break somehow. If you have any ideas of what I could debug next let me know. as of now I'm our of ideas

<!-- gh-comment-id:2711593228 --> @tendermonster commented on GitHub (Mar 10, 2025): hmm, the runners so seem to break somehow. If you have any ideas of what I could debug next let me know. as of now I'm our of ideas
Author
Owner

@rick-github commented on GitHub (Mar 10, 2025):

You could just add the server log.

<!-- gh-comment-id:2711601089 --> @rick-github commented on GitHub (Mar 10, 2025): You could just add the server log.
Author
Owner

@tendermonster commented on GitHub (Mar 10, 2025):

oke. if you see anything let me know

ollama.log

<!-- gh-comment-id:2711603247 --> @tendermonster commented on GitHub (Mar 10, 2025): oke. if you see anything let me know [ollama.log](https://github.com/user-attachments/files/19169441/ollama.log)
Author
Owner

@rick-github commented on GitHub (Mar 10, 2025):

When it gets stuck, what's the output of ollama ps?

<!-- gh-comment-id:2711634403 --> @rick-github commented on GitHub (Mar 10, 2025): When it gets stuck, what's the output of `ollama ps`?
Author
Owner

@tendermonster commented on GitHub (Mar 10, 2025):

See the screenshot above. It seems by the time I did the screenshot the model was loaded on cpu. The main suspect is the coder model. Also indicated in the logs as it mostly comes up when connection is closed faster then ollama can load the model

<!-- gh-comment-id:2711642519 --> @tendermonster commented on GitHub (Mar 10, 2025): See the screenshot above. It seems by the time I did the screenshot the model was loaded on cpu. The main suspect is the coder model. Also indicated in the logs as it mostly comes up when connection is closed faster then ollama can load the model
Author
Owner

@rick-github commented on GitHub (Mar 10, 2025):

Well, it's not the server that's creating orphan processes. Over the life of a single server, the model goes from fitting on one GPU, to being split across two GPUs, then having fewer and fewer layers offloaded, num_parallel reduced, and then eventually nothing fits on the GPU and the model is loaded into CPU. And all in the space of 44 seconds.

Mär 10 16:26:53 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 36189"
Mär 10 16:26:58 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 43987"
Mär 10 16:27:11 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 33571"
Mär 10 16:27:15 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 35803"
Mär 10 16:27:19 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 43191"
Mär 10 16:27:23 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 40737"
Mär 10 16:27:27 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 45649"
Mär 10 16:27:31 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner  --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 42417"
Mär 10 16:27:35 ollama[2186]: cmd="ollama/runners/cpu_avx2/ollama_llama_server runner  --ctx-size 2048 --batch-size 512 --threads 18 --no-mmap --parallel 1 --port 36049"
Mär 10 16:27:37 ollama[2186]: cmd="ollama/runners/cpu_avx2/ollama_llama_server runner  --ctx-size 2048 --batch-size 512 --threads 18 --no-mmap --parallel 1 --port 38303"

There may be some race condition in the runner handler that results in the runner staying active even when the handler thinks it's been terminated due to the client disconnecting, but I've been unable to replicate so far. Apart from tfs_z and num_gqa, does the client set any other options?

<!-- gh-comment-id:2711737724 --> @rick-github commented on GitHub (Mar 10, 2025): Well, it's not the server that's creating orphan processes. Over the life of a single server, the model goes from fitting on one GPU, to being split across two GPUs, then having fewer and fewer layers offloaded, num_parallel reduced, and then eventually nothing fits on the GPU and the model is loaded into CPU. And all in the space of 44 seconds. ``` Mär 10 16:26:53 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 36189" Mär 10 16:26:58 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --port 43987" Mär 10 16:27:11 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 33571" Mär 10 16:27:15 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 35803" Mär 10 16:27:19 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 18 --parallel 4 --tensor-split 25,24 --port 43191" Mär 10 16:27:23 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 40737" Mär 10 16:27:27 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 45649" Mär 10 16:27:31 ollama[2186]: cmd="ollama/runners/cuda_v12_avx/ollama_llama_server runner --ctx-size 2048 --batch-size 512 --n-gpu-layers 18 --threads 18 --parallel 1 --tensor-split 12,6 --port 42417" Mär 10 16:27:35 ollama[2186]: cmd="ollama/runners/cpu_avx2/ollama_llama_server runner --ctx-size 2048 --batch-size 512 --threads 18 --no-mmap --parallel 1 --port 36049" Mär 10 16:27:37 ollama[2186]: cmd="ollama/runners/cpu_avx2/ollama_llama_server runner --ctx-size 2048 --batch-size 512 --threads 18 --no-mmap --parallel 1 --port 38303" ``` There may be some race condition in the runner handler that results in the runner staying active even when the handler thinks it's been terminated due to the client disconnecting, but I've been unable to replicate so far. Apart from `tfs_z` and `num_gqa`, does the client set any other options?
Author
Owner

@tendermonster commented on GitHub (Mar 10, 2025):

These would be the options for tabby and openweb-ui. I'm not setting any custom options when loading the model.
I'm not surprised that some options may be incompatible. Should the same thing happen, I'll send the debug log with OLLAMA_DEBUG=1 for more infos.

using:
tabby 0.25.0
opneweb-ui: 0.5.20

<!-- gh-comment-id:2711966928 --> @tendermonster commented on GitHub (Mar 10, 2025): These would be the options for tabby and openweb-ui. I'm not setting any custom options when loading the model. I'm not surprised that some options may be incompatible. Should the same thing happen, I'll send the debug log with OLLAMA_DEBUG=1 for more infos. using: tabby 0.25.0 opneweb-ui: 0.5.20 - [tabby uses ollama-rs with](https://github.com/TabbyML/tabby/blob/bb91366019565b4de0f4c112867ab752d539e2af/crates/ollama-api-bindings/src/completion.rs#L22) - [full list of ollama-rs options](https://github.com/pepperoni21/ollama-rs/blob/19e30178f40dabd922aad1f80955464f4bb2a67a/ollama-rs/src/generation/options.rs#L5) - [openweb-ui options](https://github.com/open-webui/open-webui/blob/d7bfa395b0672a21a41fb6706a4275673d339762/backend/open_webui/utils/payload.py#L83)
Author
Owner

@tendermonster commented on GitHub (Mar 12, 2025):

so this time i got log with debug output when the model is not unloaded properly

trouble starts from Mär 11 13:00:16

if it sufficient to identify the main cause with this output ?

ollama_debug_cut.log

<!-- gh-comment-id:2718309636 --> @tendermonster commented on GitHub (Mar 12, 2025): so this time i got log with debug output when the model is not unloaded properly trouble starts from Mär 11 13:00:16 if it sufficient to identify the main cause with this output ? [ollama_debug_cut.log](https://github.com/user-attachments/files/19212996/ollama_debug_cut.log)
Author
Owner

@tendermonster commented on GitHub (Mar 17, 2025):

as a temporary workaround using cronjob to restart ollama when the problem occurs

import subprocess
def is_ollama_in_gpu_memory():
    try:
        result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
        return "ollama_llama_server" in result.stdout
    except Exception as e:
        print(f"Error checking GPU memory: {e}")
        return False
    
def is_ollama_ps_empty():
    try:
        result = subprocess.run(["ollama", "ps"], capture_output=True, text=True)
        return "NAME    ID    SIZE    PROCESSOR    UNTIL \n" == result.stdout
    except Exception as e:
        print(f"Error checking ollama models: {e}")
        return True

def restart_ollama_service():
    try:
        subprocess.run(["sudo", "systemctl", "restart", "ollama"], check=True)
    except Exception as e:
        print(f"Failed to restart ollama service: {e}")
if __name__ == "__main__":
    if is_ollama_in_gpu_memory() and is_ollama_ps_empty():
        print(
            "Detected `ollama_llama_server` in GPU memory but no models loaded. Restarting service..."
        )
        restart_ollama_service()
    else:
        print("Ollama status is normal. No action needed.")
``
<!-- gh-comment-id:2729043601 --> @tendermonster commented on GitHub (Mar 17, 2025): as a temporary workaround using cronjob to restart ollama when the problem occurs ```python import subprocess def is_ollama_in_gpu_memory(): try: result = subprocess.run(["nvidia-smi"], capture_output=True, text=True) return "ollama_llama_server" in result.stdout except Exception as e: print(f"Error checking GPU memory: {e}") return False def is_ollama_ps_empty(): try: result = subprocess.run(["ollama", "ps"], capture_output=True, text=True) return "NAME ID SIZE PROCESSOR UNTIL \n" == result.stdout except Exception as e: print(f"Error checking ollama models: {e}") return True def restart_ollama_service(): try: subprocess.run(["sudo", "systemctl", "restart", "ollama"], check=True) except Exception as e: print(f"Failed to restart ollama service: {e}") if __name__ == "__main__": if is_ollama_in_gpu_memory() and is_ollama_ps_empty(): print( "Detected `ollama_llama_server` in GPU memory but no models loaded. Restarting service..." ) restart_ollama_service() else: print("Ollama status is normal. No action needed.") ``
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52793