[GH-ISSUE #12006] ollama容器跑gpt-oss模型时,显卡莫名丢失了 #54485

Closed
opened 2026-04-29 06:07:03 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @main1015 on GitHub (Aug 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12006

What is the issue?

我使用ollama容器跑gpt-oss:20b模型时,显卡莫名丢失了,在容器里面执行:

root@ad006a655920:/# nvidia-smi
Failed to initialize NVML: Unknown Error

我ollama容器镜像是:

(base) wy@myth:~$ docker images |grep ollama
ollama/ollama                                                       latest          53f18253db46   2 weeks ago     2.28GB

这是什么情况???
下面是我ollama容器的日志:

load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
time=2025-08-08T02:05:57.655Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-08T02:05:57.742Z level=INFO source=ggml.go:367 msg="offloading 24 repeating layers to GPU"
time=2025-08-08T02:05:57.742Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
time=2025-08-08T02:05:57.742Z level=INFO source=ggml.go:378 msg="offloaded 24/25 layers to GPU"
time=2025-08-08T02:05:57.742Z level=INFO source=ggml.go:381 msg="model weights" buffer=CUDA0 size="10.7 GiB"
time=2025-08-08T02:05:57.742Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="2.2 GiB"
time=2025-08-08T02:05:57.750Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="8.1 GiB"
time=2025-08-08T02:05:57.750Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB"
time=2025-08-08T02:05:57.801Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-08T02:06:00.814Z level=INFO source=server.go:637 msg="llama runner started in 3.26 seconds"
[GIN] 2025/08/08 - 02:06:12 | 200 |      22.763µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:06:12 | 200 |      25.861µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/08 - 02:07:14 | 200 |         1m18s |      172.17.0.1 | POST     "/v1/chat/completions"
[GIN] 2025/08/08 - 02:07:46 | 200 |      21.995µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:07:46 | 200 |      23.404µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/08 - 02:07:47 | 200 |      35.693µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:07:47 | 200 |       28.29µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/08 - 02:09:41 | 200 |      41.661µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:09:41 | 200 |      73.833µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/08 - 02:15:40 | 404 |     948.688µs |      172.17.0.1 | POST     "/api/generate"
[GIN] 2025/08/08 - 02:15:42 | 404 |     795.312µs |      172.17.0.1 | POST     "/api/generate"
time=2025-08-08T02:15:42.418Z level=INFO source=server.go:135 msg="system memory" total="125.5 GiB" free="101.6 GiB" free_swap="2.0 GiB"
time=2025-08-08T02:15:42.418Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[21.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.1 GiB" memory.required.partial="0 B" memory.required.kv="3.4 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="32.0 GiB" memory.graph.partial="32.0 GiB"
time=2025-08-08T02:15:42.459Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 131072 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 34903"
time=2025-08-08T02:15:42.459Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-08T02:15:42.459Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-08T02:15:42.459Z level=WARN source=server.go:605 msg="client connection closed before server finished loading, aborting load"
time=2025-08-08T02:15:42.459Z level=ERROR source=sched.go:487 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
[GIN] 2025/08/08 - 02:15:42 | 499 |  1.182658609s |      172.17.0.1 | POST     "/api/generate"
time=2025-08-08T02:15:42.468Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-08T02:15:42.469Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:34903"
time=2025-08-08T02:15:42.512Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
time=2025-08-08T02:15:42.598Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-08T02:15:43.353Z level=INFO source=server.go:135 msg="system memory" total="125.5 GiB" free="101.6 GiB" free_swap="2.0 GiB"
time=2025-08-08T02:15:43.354Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[21.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.1 GiB" memory.required.partial="0 B" memory.required.kv="3.4 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="32.0 GiB" memory.graph.partial="32.0 GiB"
time=2025-08-08T02:15:43.397Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 131072 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 42821"
time=2025-08-08T02:15:43.397Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-08T02:15:43.397Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-08T02:15:43.397Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-08T02:15:43.406Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-08T02:15:43.406Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:42821"
time=2025-08-08T02:15:43.449Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
time=2025-08-08T02:15:43.501Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-08T02:15:43.588Z level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU"
time=2025-08-08T02:15:43.588Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
time=2025-08-08T02:15:43.588Z level=INFO source=ggml.go:378 msg="offloaded 0/25 layers to GPU"
time=2025-08-08T02:15:43.588Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.8 GiB"
time=2025-08-08T02:15:43.648Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-08T02:15:44.647Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-08T02:15:44.647Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="32.0 GiB"
time=2025-08-08T02:15:45.909Z level=INFO source=server.go:637 msg="llama runner started in 2.51 seconds"
[GIN] 2025/08/08 - 02:15:46 | 404 |     842.099µs |      172.17.0.1 | POST     "/api/generate"
[GIN] 2025/08/08 - 02:15:46 | 404 |     835.145µs |      172.17.0.1 | POST     "/api/generate"
time=2025-08-08T02:15:46.720Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled"
[GIN] 2025/08/08 - 02:15:46 | 200 |  4.337180182s |      172.17.0.1 | POST     "/api/generate"
[GIN] 2025/08/08 - 02:15:48 | 404 |     797.596µs |      172.17.0.1 | POST     "/api/generate"
[GIN] 2025/08/08 - 02:15:48 | 404 |     821.356µs |      172.17.0.1 | POST     "/api/generate"
time=2025-08-08T02:15:49.021Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled"
[GIN] 2025/08/08 - 02:15:49 | 200 |  2.272718522s |      172.17.0.1 | POST     "/api/generate"
[GIN] 2025/08/08 - 02:15:49 | 404 |     820.049µs |      172.17.0.1 | POST     "/api/generate"
time=2025-08-08T02:15:49.657Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled"
[GIN] 2025/08/08 - 02:15:49 | 200 |  606.929414ms |      172.17.0.1 | POST     "/api/generate"
[GIN] 2025/08/08 - 02:15:52 | 404 |     783.625µs |      172.17.0.1 | POST     "/api/generate"
time=2025-08-08T02:15:53.306Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled"
[GIN] 2025/08/08 - 02:15:53 | 200 |  3.622453821s |      172.17.0.1 | POST     "/api/generate"
[GIN] 2025/08/08 - 02:15:53 | 200 |  101.029513ms |      172.17.0.1 | POST     "/api/show"
[GIN] 2025/08/08 - 02:15:54 | 200 |   78.520838ms |      172.17.0.1 | POST     "/api/show"
[GIN] 2025/08/08 - 02:15:55 | 200 |  100.599109ms |      172.17.0.1 | POST     "/api/show"
[GIN] 2025/08/08 - 02:15:55 | 200 |    99.47526ms |      172.17.0.1 | POST     "/api/show"
[GIN] 2025/08/08 - 02:15:55 | 200 |  101.515257ms |      172.17.0.1 | POST     "/api/show"
[GIN] 2025/08/08 - 02:16:40 | 200 |      27.406µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:16:40 | 200 |      30.508µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/08 - 02:17:05 | 200 |       22.33µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:17:05 | 200 |      21.118µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/08 - 02:17:47 | 200 |      24.005µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:17:47 | 200 |      27.324µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/08 - 02:17:48 | 200 |      24.076µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:17:48 | 200 |      24.453µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/08 - 02:18:36 | 200 |      31.943µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:18:36 | 200 |      39.346µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/08 - 02:18:37 | 200 |       23.75µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:18:37 | 200 |      34.403µs |       127.0.0.1 | GET      "/api/ps"
time=2025-08-08T02:18:39.424Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled"
time=2025-08-08T02:18:39.424Z level=INFO source=runner.go:646 msg="aborting completion request due to client closing the connection"
[GIN] 2025/08/08 - 02:18:39 | 200 |         2m46s |      172.17.0.1 | POST     "/api/generate"
[GIN] 2025/08/08 - 02:19:15 | 200 |      28.224µs |      172.17.0.1 | GET      "/"
[GIN] 2025/08/08 - 02:19:15 | 404 |       5.403µs |      172.17.0.1 | GET      "/favicon.ico"
[GIN] 2025/08/08 - 02:19:37 | 200 |      21.216µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:19:37 | 200 |      23.773µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/08 - 02:19:37 | 200 |      22.829µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:19:37 | 200 |      22.603µs |       127.0.0.1 | GET      "/api/ps"
time=2025-08-08T02:21:19.309Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled"
time=2025-08-08T02:21:19.309Z level=INFO source=runner.go:646 msg="aborting completion request due to client closing the connection"
[GIN] 2025/08/08 - 02:21:19 | 200 |         1m48s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/08 - 02:21:42 | 200 |      21.686µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:21:42 | 200 |      23.775µs |       127.0.0.1 | GET      "/api/ps"
time=2025-08-08T02:21:48.022Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled"
time=2025-08-08T02:21:48.022Z level=INFO source=runner.go:646 msg="aborting completion request due to client closing the connection"
[GIN] 2025/08/08 - 02:21:48 | 200 | 17.287678521s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/08 - 02:23:22 | 200 |      21.187µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/08 - 02:23:22 | 200 |    1.270011ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/08/08 - 02:23:32 | 404 |     860.202µs |      172.17.0.1 | POST     "/api/generate"
[GIN] 2025/08/08 - 02:23:34 | 404 |     874.441µs |      172.17.0.1 | POST     "/api/generate"
time=2025-08-08T02:23:35.027Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled"
time=2025-08-08T02:23:35.027Z level=INFO source=runner.go:646 msg="aborting completion request due to client closing the connection"

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @main1015 on GitHub (Aug 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12006 ### What is the issue? 我使用ollama容器跑gpt-oss:20b模型时,显卡莫名丢失了,在容器里面执行: ``` root@ad006a655920:/# nvidia-smi Failed to initialize NVML: Unknown Error ``` 我ollama容器镜像是: ``` (base) wy@myth:~$ docker images |grep ollama ollama/ollama latest 53f18253db46 2 weeks ago 2.28GB ``` 这是什么情况??? 下面是我ollama容器的日志: ``` load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so time=2025-08-08T02:05:57.655Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-08T02:05:57.742Z level=INFO source=ggml.go:367 msg="offloading 24 repeating layers to GPU" time=2025-08-08T02:05:57.742Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU" time=2025-08-08T02:05:57.742Z level=INFO source=ggml.go:378 msg="offloaded 24/25 layers to GPU" time=2025-08-08T02:05:57.742Z level=INFO source=ggml.go:381 msg="model weights" buffer=CUDA0 size="10.7 GiB" time=2025-08-08T02:05:57.742Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="2.2 GiB" time=2025-08-08T02:05:57.750Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="8.1 GiB" time=2025-08-08T02:05:57.750Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB" time=2025-08-08T02:05:57.801Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-08T02:06:00.814Z level=INFO source=server.go:637 msg="llama runner started in 3.26 seconds" [GIN] 2025/08/08 - 02:06:12 | 200 | 22.763µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:06:12 | 200 | 25.861µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/08 - 02:07:14 | 200 | 1m18s | 172.17.0.1 | POST "/v1/chat/completions" [GIN] 2025/08/08 - 02:07:46 | 200 | 21.995µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:07:46 | 200 | 23.404µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/08 - 02:07:47 | 200 | 35.693µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:07:47 | 200 | 28.29µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/08 - 02:09:41 | 200 | 41.661µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:09:41 | 200 | 73.833µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/08 - 02:15:40 | 404 | 948.688µs | 172.17.0.1 | POST "/api/generate" [GIN] 2025/08/08 - 02:15:42 | 404 | 795.312µs | 172.17.0.1 | POST "/api/generate" time=2025-08-08T02:15:42.418Z level=INFO source=server.go:135 msg="system memory" total="125.5 GiB" free="101.6 GiB" free_swap="2.0 GiB" time=2025-08-08T02:15:42.418Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[21.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.1 GiB" memory.required.partial="0 B" memory.required.kv="3.4 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="32.0 GiB" memory.graph.partial="32.0 GiB" time=2025-08-08T02:15:42.459Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 131072 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 34903" time=2025-08-08T02:15:42.459Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-08T02:15:42.459Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-08T02:15:42.459Z level=WARN source=server.go:605 msg="client connection closed before server finished loading, aborting load" time=2025-08-08T02:15:42.459Z level=ERROR source=sched.go:487 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" [GIN] 2025/08/08 - 02:15:42 | 499 | 1.182658609s | 172.17.0.1 | POST "/api/generate" time=2025-08-08T02:15:42.468Z level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-08T02:15:42.469Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:34903" time=2025-08-08T02:15:42.512Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so time=2025-08-08T02:15:42.598Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-08T02:15:43.353Z level=INFO source=server.go:135 msg="system memory" total="125.5 GiB" free="101.6 GiB" free_swap="2.0 GiB" time=2025-08-08T02:15:43.354Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[21.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.1 GiB" memory.required.partial="0 B" memory.required.kv="3.4 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="32.0 GiB" memory.graph.partial="32.0 GiB" time=2025-08-08T02:15:43.397Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 131072 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 42821" time=2025-08-08T02:15:43.397Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-08T02:15:43.397Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-08T02:15:43.397Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-08T02:15:43.406Z level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-08T02:15:43.406Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:42821" time=2025-08-08T02:15:43.449Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so time=2025-08-08T02:15:43.501Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-08T02:15:43.588Z level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU" time=2025-08-08T02:15:43.588Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU" time=2025-08-08T02:15:43.588Z level=INFO source=ggml.go:378 msg="offloaded 0/25 layers to GPU" time=2025-08-08T02:15:43.588Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.8 GiB" time=2025-08-08T02:15:43.648Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-08T02:15:44.647Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-08-08T02:15:44.647Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="32.0 GiB" time=2025-08-08T02:15:45.909Z level=INFO source=server.go:637 msg="llama runner started in 2.51 seconds" [GIN] 2025/08/08 - 02:15:46 | 404 | 842.099µs | 172.17.0.1 | POST "/api/generate" [GIN] 2025/08/08 - 02:15:46 | 404 | 835.145µs | 172.17.0.1 | POST "/api/generate" time=2025-08-08T02:15:46.720Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled" [GIN] 2025/08/08 - 02:15:46 | 200 | 4.337180182s | 172.17.0.1 | POST "/api/generate" [GIN] 2025/08/08 - 02:15:48 | 404 | 797.596µs | 172.17.0.1 | POST "/api/generate" [GIN] 2025/08/08 - 02:15:48 | 404 | 821.356µs | 172.17.0.1 | POST "/api/generate" time=2025-08-08T02:15:49.021Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled" [GIN] 2025/08/08 - 02:15:49 | 200 | 2.272718522s | 172.17.0.1 | POST "/api/generate" [GIN] 2025/08/08 - 02:15:49 | 404 | 820.049µs | 172.17.0.1 | POST "/api/generate" time=2025-08-08T02:15:49.657Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled" [GIN] 2025/08/08 - 02:15:49 | 200 | 606.929414ms | 172.17.0.1 | POST "/api/generate" [GIN] 2025/08/08 - 02:15:52 | 404 | 783.625µs | 172.17.0.1 | POST "/api/generate" time=2025-08-08T02:15:53.306Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled" [GIN] 2025/08/08 - 02:15:53 | 200 | 3.622453821s | 172.17.0.1 | POST "/api/generate" [GIN] 2025/08/08 - 02:15:53 | 200 | 101.029513ms | 172.17.0.1 | POST "/api/show" [GIN] 2025/08/08 - 02:15:54 | 200 | 78.520838ms | 172.17.0.1 | POST "/api/show" [GIN] 2025/08/08 - 02:15:55 | 200 | 100.599109ms | 172.17.0.1 | POST "/api/show" [GIN] 2025/08/08 - 02:15:55 | 200 | 99.47526ms | 172.17.0.1 | POST "/api/show" [GIN] 2025/08/08 - 02:15:55 | 200 | 101.515257ms | 172.17.0.1 | POST "/api/show" [GIN] 2025/08/08 - 02:16:40 | 200 | 27.406µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:16:40 | 200 | 30.508µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/08 - 02:17:05 | 200 | 22.33µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:17:05 | 200 | 21.118µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/08 - 02:17:47 | 200 | 24.005µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:17:47 | 200 | 27.324µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/08 - 02:17:48 | 200 | 24.076µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:17:48 | 200 | 24.453µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/08 - 02:18:36 | 200 | 31.943µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:18:36 | 200 | 39.346µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/08 - 02:18:37 | 200 | 23.75µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:18:37 | 200 | 34.403µs | 127.0.0.1 | GET "/api/ps" time=2025-08-08T02:18:39.424Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled" time=2025-08-08T02:18:39.424Z level=INFO source=runner.go:646 msg="aborting completion request due to client closing the connection" [GIN] 2025/08/08 - 02:18:39 | 200 | 2m46s | 172.17.0.1 | POST "/api/generate" [GIN] 2025/08/08 - 02:19:15 | 200 | 28.224µs | 172.17.0.1 | GET "/" [GIN] 2025/08/08 - 02:19:15 | 404 | 5.403µs | 172.17.0.1 | GET "/favicon.ico" [GIN] 2025/08/08 - 02:19:37 | 200 | 21.216µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:19:37 | 200 | 23.773µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/08 - 02:19:37 | 200 | 22.829µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:19:37 | 200 | 22.603µs | 127.0.0.1 | GET "/api/ps" time=2025-08-08T02:21:19.309Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled" time=2025-08-08T02:21:19.309Z level=INFO source=runner.go:646 msg="aborting completion request due to client closing the connection" [GIN] 2025/08/08 - 02:21:19 | 200 | 1m48s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/08 - 02:21:42 | 200 | 21.686µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:21:42 | 200 | 23.775µs | 127.0.0.1 | GET "/api/ps" time=2025-08-08T02:21:48.022Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled" time=2025-08-08T02:21:48.022Z level=INFO source=runner.go:646 msg="aborting completion request due to client closing the connection" [GIN] 2025/08/08 - 02:21:48 | 200 | 17.287678521s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/08 - 02:23:22 | 200 | 21.187µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/08 - 02:23:22 | 200 | 1.270011ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/08/08 - 02:23:32 | 404 | 860.202µs | 172.17.0.1 | POST "/api/generate" [GIN] 2025/08/08 - 02:23:34 | 404 | 874.441µs | 172.17.0.1 | POST "/api/generate" time=2025-08-08T02:23:35.027Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:42821/completion\": context canceled" time=2025-08-08T02:23:35.027Z level=INFO source=runner.go:646 msg="aborting completion request due to client closing the connection" ``` ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-29 06:07:03 -05:00
Author
Owner
<!-- gh-comment-id:3209878944 --> @rick-github commented on GitHub (Aug 21, 2025): https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#linux-docker
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54485