[GH-ISSUE #11656] Custom ollama serving #33465

Closed
opened 2026-04-22 16:09:11 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @HungLe2511 on GitHub (Aug 4, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11656

i try to change config of ollama serving for wanna use ollama for my prod. I cant search document for detail to change config của serving. But when i check in log system. i collected a list of config of this. Detail in :

Image

Can you share me full config or document for this. I cant read Go code. so i dont understand that.

Originally created by @HungLe2511 on GitHub (Aug 4, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11656 i try to change config of ollama serving for wanna use ollama for my prod. I cant search document for detail to change config của serving. But when i check in log system. i collected a list of config of this. Detail in : <img width="1542" height="809" alt="Image" src="https://github.com/user-attachments/assets/0447d227-f61c-4a7c-ab8d-a8d13f53a108" /> Can you share me full config or document for this. I cant read Go code. so i dont understand that.
Author
Owner

@HungLe2511 commented on GitHub (Aug 4, 2025):

i use this config by docker like :
root@10-0-0-13:~/models/ollama/qwen3-30b-0# cat docker-compose.yml
services:
ollama_gpu0:
image: ollama/ollama
container_name: v03-ollama-gpu0
env_file: .env
ports:
- "11436:11434"
environment:
- OLLAMA_MODELS=/models
volumes:
- qwen3-30b_ollama_models:/models
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: ["gpu"]

volumes:
qwen3-30b_ollama_models:
external: true

is that right ?

<!-- gh-comment-id:3149862771 --> @HungLe2511 commented on GitHub (Aug 4, 2025): i use this config by docker like : root@10-0-0-13:~/models/ollama/qwen3-30b-0# cat docker-compose.yml services: ollama_gpu0: image: ollama/ollama container_name: v03-ollama-gpu0 env_file: .env ports: - "11436:11434" environment: - OLLAMA_MODELS=/models volumes: - qwen3-30b_ollama_models:/models deploy: resources: reservations: devices: - driver: nvidia device_ids: ["0"] capabilities: ["gpu"] volumes: qwen3-30b_ollama_models: external: true is that right ?
Author
Owner

@rick-github commented on GitHub (Aug 6, 2025):

It's not clear what your problem is. Can you explain it a bit more?

<!-- gh-comment-id:3159282919 --> @rick-github commented on GitHub (Aug 6, 2025): It's not clear what your problem is. Can you explain it a bit more?
Author
Owner

@HungLe2511 commented on GitHub (Aug 6, 2025):

My ollama is not working properly, after a period of use (I currently don't have specific statistics about its throughput), it takes a long time to respond or there is a missing request (I'm using curl and postman)
The above information is what I got from the ollama log when it was running, not from any official source. I understand that the Go language will build a bin file to execute, so even if I use those environment variables, it is not certain that the bin file will be used. So in the list of variables above, there are some variables that receive config, some variables do not.

<!-- gh-comment-id:3159444554 --> @HungLe2511 commented on GitHub (Aug 6, 2025): My ollama is not working properly, after a period of use (I currently don't have specific statistics about its throughput), it takes a long time to respond or there is a missing request (I'm using curl and postman) The above information is what I got from the ollama log when it was running, not from any official source. I understand that the Go language will build a bin file to execute, so even if I use those environment variables, it is not certain that the bin file will be used. So in the list of variables above, there are some variables that receive config, some variables do not.
Author
Owner

@rick-github commented on GitHub (Aug 6, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:3159481998 --> @rick-github commented on GitHub (Aug 6, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@HungLe2511 commented on GitHub (Aug 7, 2025):

load_tensors: loading model tensors, this can take a while... (mmap = true)
time=2025-08-07T07:00:02.752Z level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.502832294 runner.size="19.8 GiB" runner.vram="19.8 GiB" runner.parallel=1 runner.pid=2300029 runner.model=/models/blobs/sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
load_tensors: CPU_Mapped model buffer size = 17754.15 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.59 MiB
llama_kv_cache_unified: kv_size = 16384, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 32
llama_kv_cache_unified: CPU KV buffer size = 1536.00 MiB
llama_kv_cache_unified: KV self size = 1536.00 MiB, K (f16): 768.00 MiB, V (f16): 768.00 MiB
llama_context: CPU compute buffer size = 1080.01 MiB
llama_context: graph nodes = 3126
llama_context: graph splits = 1
time=2025-08-07T07:00:05.200Z level=INFO source=server.go:630 msg="llama runner started in 2.76 seconds"
[GIN] 2025/08/07 - 07:00:27 | 200 | 42.486982457s | 172.20.0.1 | POST "/api/chat"
cuda driver library failed to get device context 800time=2025-08-07T07:00:27.794Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:28.047Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:28.296Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:28.547Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:28.797Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:29.047Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:29.296Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:29.547Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:29.797Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:30.047Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:30.297Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:30.547Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:30.796Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:31.047Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:31.296Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:31.546Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:31.797Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:32.046Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:32.297Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:32.547Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
time=2025-08-07T07:00:32.794Z level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.002852406 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=1 runner.pid=2304282 runner.model=/models/blobs/sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac
cuda driver library failed to get device context 800time=2025-08-07T07:00:32.797Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2025-08-07T07:00:32.799Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
time=2025-08-07T07:00:32.845Z level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=/models/blobs/sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac library=cuda parallel=1 required="19.8 GiB"
cuda driver library failed to get device context 800time=2025-08-07T07:00:32.848Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"

<!-- gh-comment-id:3163025513 --> @HungLe2511 commented on GitHub (Aug 7, 2025): load_tensors: loading model tensors, this can take a while... (mmap = true) time=2025-08-07T07:00:02.752Z level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.502832294 runner.size="19.8 GiB" runner.vram="19.8 GiB" runner.parallel=1 runner.pid=2300029 runner.model=/models/blobs/sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac load_tensors: CPU_Mapped model buffer size = 17754.15 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 16384 llama_context: n_ctx_per_seq = 16384 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (16384) < n_ctx_train (40960) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.59 MiB llama_kv_cache_unified: kv_size = 16384, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 32 llama_kv_cache_unified: CPU KV buffer size = 1536.00 MiB llama_kv_cache_unified: KV self size = 1536.00 MiB, K (f16): 768.00 MiB, V (f16): 768.00 MiB llama_context: CPU compute buffer size = 1080.01 MiB llama_context: graph nodes = 3126 llama_context: graph splits = 1 time=2025-08-07T07:00:05.200Z level=INFO source=server.go:630 msg="llama runner started in 2.76 seconds" [GIN] 2025/08/07 - 07:00:27 | 200 | 42.486982457s | 172.20.0.1 | POST "/api/chat" cuda driver library failed to get device context 800time=2025-08-07T07:00:27.794Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:28.047Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:28.296Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:28.547Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:28.797Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:29.047Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:29.296Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:29.547Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:29.797Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:30.047Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:30.297Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:30.547Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:30.796Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:31.047Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:31.296Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:31.546Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:31.797Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:32.046Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:32.297Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:32.547Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" time=2025-08-07T07:00:32.794Z level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.002852406 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=1 runner.pid=2304282 runner.model=/models/blobs/sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac cuda driver library failed to get device context 800time=2025-08-07T07:00:32.797Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2025-08-07T07:00:32.799Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory" time=2025-08-07T07:00:32.845Z level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=/models/blobs/sha256-e9183b5c18a0cf736578c1e3d1cbd4b7e98e3ad3be6176b68c20f156d54a07ac library=cuda parallel=1 required="19.8 GiB" cuda driver library failed to get device context 800time=2025-08-07T07:00:32.848Z level=WARN source=gpu.go:434 msg="error looking up nvidia GPU memory"
Author
Owner

@HungLe2511 commented on GitHub (Aug 7, 2025):

help me fix that

<!-- gh-comment-id:3163026175 --> @HungLe2511 commented on GitHub (Aug 7, 2025): help me fix that
Author
Owner

@rick-github commented on GitHub (Aug 7, 2025):

cuda driver library failed to get device context 800

CUDA error code 800 is CUDA_ERROR_NOT_PERMITTED. Does restarting the container restore operation or do you have to do something else (eg, reboot or run some nvidia command)? Is there anything in the system logs (dmesg, /var/log/syslog, /var/log/kern.log, etc) that indicates anything unusual with the nvidia devices?

Also see https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#linux-docker and https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#linux-nvidia-troubleshooting

<!-- gh-comment-id:3163042758 --> @rick-github commented on GitHub (Aug 7, 2025): ``` cuda driver library failed to get device context 800 ``` CUDA error code 800 is [`CUDA_ERROR_NOT_PERMITTED`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#:~:text=CUDA_ERROR_NOT_PERMITTED%20%3D%20800). Does restarting the container restore operation or do you have to do something else (eg, reboot or run some nvidia command)? Is there anything in the system logs (`dmesg`, `/var/log/syslog`, `/var/log/kern.log`, etc) that indicates anything unusual with the nvidia devices? Also see https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#linux-docker and https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#linux-nvidia-troubleshooting
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33465