[GH-ISSUE #10671] 6K context, only thtough API: error loading llama server "timed out waiting for llama runner to start: context canceled" #7016

Closed
opened 2026-04-12 18:54:39 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @j2l on GitHub (May 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10671

What is the issue?

For DSR1 14B and Gemma3 12B on a Ubuntu22+RTX3060 (12GB VRAM, 24GB RAM) with a 6K tokens long context:

  • Using ollama run, both models can ingest and reply correctly.
  • Using the API, for both, I get the error: error loading llama server "timed out waiting for llama runner to start: context canceled"

Seems to be a loading timeout issue to me.

Relevant log output

time=2025-05-12T10:42:14.669Z level=INFO source=sched.go:517 msg="updated VRAM based on existing loaded models" gpu=GPU-b04b66c1-d5eb-66df-899a-6e62a122d319 library=cuda total="11.7 GiB" available="616.9 MiB"

time=2025-05-12T10:42:15.447Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32

time=2025-05-12T10:42:15.485Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32

time=2025-05-12T10:42:15.636Z level=INFO source=server.go:106 msg="system memory" total="23.3 GiB" free="16.8 GiB" free_swap="12.9 GiB"

time=2025-05-12T10:42:15.637Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[11.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.4 GiB" memory.required.partial="10.6 GiB" memory.required.kv="992.0 MiB" memory.required.allocations="[10.6 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"

time=2025-05-12T10:42:15.708Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32

time=2025-05-12T10:42:15.709Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false

time=2025-05-12T10:42:15.714Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07

time=2025-05-12T10:42:15.714Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000

time=2025-05-12T10:42:15.714Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06

time=2025-05-12T10:42:15.714Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1

time=2025-05-12T10:42:15.714Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256

time=2025-05-12T10:42:15.714Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 --ctx-size 8192 --batch-size 512 --n-gpu-layers 48 --threads 6 --parallel 1 --port 44559"

time=2025-05-12T10:42:15.715Z level=INFO source=sched.go:452 msg="loaded runners" count=1

time=2025-05-12T10:42:15.715Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding"

time=2025-05-12T10:42:15.715Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding"

time=2025-05-12T10:42:15.724Z level=INFO source=runner.go:851 msg="starting ollama engine"

time=2025-05-12T10:42:15.724Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:44559"

time=2025-05-12T10:42:15.792Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32

time=2025-05-12T10:42:15.793Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""

time=2025-05-12T10:42:15.793Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""

time=2025-05-12T10:42:15.793Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=36

load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no

ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no

ggml_cuda_init: found 1 CUDA devices:

  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes

load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so

time=2025-05-12T10:42:15.847Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)

time=2025-05-12T10:42:15.939Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="2.3 GiB"

time=2025-05-12T10:42:15.939Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="6.0 GiB"

time=2025-05-12T10:42:15.966Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"

time=2025-05-12T10:42:29.765Z level=WARN source=server.go:596 msg="client connection closed before server finished loading, aborting load"

time=2025-05-12T10:42:29.765Z level=ERROR source=sched.go:458 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"

[GIN] 2025/05/12 - 10:42:29 | 499 | 15.422711327s |      172.17.0.1 | POST     "/api/generate"

time=2025-05-12T10:42:34.924Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.158672739 model=/root/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3

time=2025-05-12T10:42:35.174Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.408708976 model=/root/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3

time=2025-05-12T10:42:35.424Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.6590568359999995 model=/root/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3

OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.6.8

Originally created by @j2l on GitHub (May 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10671 ### What is the issue? For DSR1 14B and Gemma3 12B on a Ubuntu22+RTX3060 (12GB VRAM, 24GB RAM) with a 6K tokens long context: - Using `ollama run`, both models can ingest and reply correctly. - Using the API, for both, I get the error: error loading llama server "timed out waiting for llama runner to start: context canceled" Seems to be a loading timeout issue to me. ### Relevant log output ```shell time=2025-05-12T10:42:14.669Z level=INFO source=sched.go:517 msg="updated VRAM based on existing loaded models" gpu=GPU-b04b66c1-d5eb-66df-899a-6e62a122d319 library=cuda total="11.7 GiB" available="616.9 MiB" time=2025-05-12T10:42:15.447Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-12T10:42:15.485Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-12T10:42:15.636Z level=INFO source=server.go:106 msg="system memory" total="23.3 GiB" free="16.8 GiB" free_swap="12.9 GiB" time=2025-05-12T10:42:15.637Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[11.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.4 GiB" memory.required.partial="10.6 GiB" memory.required.kv="992.0 MiB" memory.required.allocations="[10.6 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-05-12T10:42:15.708Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-12T10:42:15.709Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-05-12T10:42:15.714Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-05-12T10:42:15.714Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-05-12T10:42:15.714Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-05-12T10:42:15.714Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-05-12T10:42:15.714Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-05-12T10:42:15.714Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 --ctx-size 8192 --batch-size 512 --n-gpu-layers 48 --threads 6 --parallel 1 --port 44559" time=2025-05-12T10:42:15.715Z level=INFO source=sched.go:452 msg="loaded runners" count=1 time=2025-05-12T10:42:15.715Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding" time=2025-05-12T10:42:15.715Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding" time=2025-05-12T10:42:15.724Z level=INFO source=runner.go:851 msg="starting ollama engine" time=2025-05-12T10:42:15.724Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:44559" time=2025-05-12T10:42:15.792Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-12T10:42:15.793Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" time=2025-05-12T10:42:15.793Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" time=2025-05-12T10:42:15.793Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=36 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2025-05-12T10:42:15.847Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-05-12T10:42:15.939Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="2.3 GiB" time=2025-05-12T10:42:15.939Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="6.0 GiB" time=2025-05-12T10:42:15.966Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" time=2025-05-12T10:42:29.765Z level=WARN source=server.go:596 msg="client connection closed before server finished loading, aborting load" time=2025-05-12T10:42:29.765Z level=ERROR source=sched.go:458 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" [GIN] 2025/05/12 - 10:42:29 | 499 | 15.422711327s | 172.17.0.1 | POST "/api/generate" time=2025-05-12T10:42:34.924Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.158672739 model=/root/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 time=2025-05-12T10:42:35.174Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.408708976 model=/root/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 time=2025-05-12T10:42:35.424Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.6590568359999995 model=/root/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 ``` ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.8
GiteaMirror added the bug label 2026-04-12 18:54:39 -05:00
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

time=2025-05-12T10:42:29.765Z level=WARN source=server.go:596 msg="client connection closed before server finished loading, aborting load"

Client has a ~15 second timeout.

<!-- gh-comment-id:2872051022 --> @rick-github commented on GitHub (May 12, 2025): ``` time=2025-05-12T10:42:29.765Z level=WARN source=server.go:596 msg="client connection closed before server finished loading, aborting load" ``` Client has a ~15 second timeout.
Author
Owner

@j2l commented on GitHub (May 12, 2025):

@rick-github thank you!
I have no idea how to change it in svelte 3. Well, I'll have to dig.

<!-- gh-comment-id:2873248717 --> @j2l commented on GitHub (May 12, 2025): @rick-github thank you! I have no idea how to change it in svelte 3. Well, I'll have to dig.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7016