[GH-ISSUE #13250] Cannot load model even tho I have enough vRAM #55271

Closed
opened 2026-04-29 08:41:25 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @YetheSamartaka on GitHub (Nov 26, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13250

What is the issue?

I have two instances of Ollama in docker, one running at port 11434, other at 11439. I have 3x 4090, where 1st one has 18/24 vRAM and others have 17/24 and 17/24 vRAM. I want to load another model in the second instance which will take 5,2 GB of vRAM (Plenty enough to fit in when I have the sched spread enabled). The second instance of Ollama does not have any models loaded. For both Docker containers, I'm using these env variables:

-e OLLAMA_MAX_LOADED_MODELS=3 -e OLLAMA_KEEP_ALIVE=-1 -e OLLAMA_SCHED_SPREAD=1 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_KV_CACHE_TYPE=q8_0 -e OLLAMA_LOAD_TIMEOUT=60m -e OLLAMA_ORIGINS="*"

And I get following error after I try to ollama run the model qwen3-embedding-0.6b-q8_0 using command:
ollama run qwen3-embedding-0.6b-q8_0:latest "Init"

I get this error:
Error: do load request: Post "http://127.0.0.1:38057/load": EOF

But I cannot load any other models. I have to note that I already have 1 instance of this model running on the first ollama docker instance on port 11434 that is working as supposed. And if I would free up memory, I can load two of these models at both docker containers without any issues and they are working for my use case as supposed (I am then using Nomyo Router so I can serve multiple instances of the same model to do load balancing)

It seems that memory estimation is not working properly even tho the model should fit comfortably. More details in the log.

Relevant log output

time=2025-11-26T07:59:53.433Z level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 35507"
time=2025-11-26T07:59:56.436Z level=INFO source=runner.go:449 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v12]" extra_envs=map[] error="failed to finish discovery before timeout"
time=2025-11-26T07:59:56.437Z level=WARN source=runner.go:341 msg="unable to refresh free memory, using old values"
time=2025-11-26T08:00:04.012Z level=INFO source=sched.go:443 msg="system memory" total="125.6 GiB" free="125.4 GiB" free_swap="32.0 GiB"
time=2025-11-26T08:00:04.012Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-46e43120-44ac-f5d2-97b9-b220f8578118 library=CUDA available="8.6 GiB" free="9.1 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-11-26T08:00:04.012Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-9ee3a284-3456-57ff-f595-6f2e0b1db136 library=CUDA available="8.0 GiB" free="8.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-11-26T08:00:04.012Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-0978d7da-f602-d3ba-aedb-b466066378ac library=CUDA available="8.8 GiB" free="9.2 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-11-26T08:00:04.012Z level=INFO source=server.go:702 msg="loading model" "model layers"=29 requested=-1
time=2025-11-26T08:00:04.036Z level=INFO source=runner.go:1398 msg="starting ollama engine"
time=2025-11-26T08:00:04.039Z level=INFO source=runner.go:1433 msg="Server listening on 127.0.0.1:34863"
time=2025-11-26T08:00:04.047Z level=INFO source=runner.go:1271 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:64 GPULayers:29[ID:GPU-9ee3a284-3456-57ff-f595-6f2e0b1db136 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-26T08:00:04.109Z level=INFO source=ggml.go:136 msg="" architecture=qwen3 file_type=Q8_0 name="Qwen3 Embedding 0.6b" description="" num_tensors=310 num_key_values=37
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-46e43120-44ac-f5d2-97b9-b220f8578118
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-9ee3a284-3456-57ff-f595-6f2e0b1db136
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0978d7da-f602-d3ba-aedb-b466066378ac
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-11-26T08:00:04.360Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
CUDA error: out of memory
current device: 1, in function ggml_cuda_set_device at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:101
  cudaSetDevice(device)
//ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:88: CUDA error
/usr/lib/ollama/libggml-base.so(+0x1a858)[0x73e63c125858]
/usr/lib/ollama/libggml-base.so(ggml_print_backtrace+0x1e6)[0x73e63c125c26]
/usr/lib/ollama/libggml-base.so(ggml_abort+0x11d)[0x73e63c125dad]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x1223a2)[0x73e5b24a63a2]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x1226fe)[0x73e5b24a66fe]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x122fd0)[0x73e5b24a6fd0]
/usr/bin/ollama(+0x10ea0de)[0x5781890ef0de]
/usr/bin/ollama(+0x10ecdd8)[0x5781890f1dd8]
/usr/bin/ollama(+0x1081fbb)[0x578189086fbb]
/usr/bin/ollama(+0x371aa1)[0x578188376aa1]
SIGABRT: abort
PC=0x73e684907b2c m=9 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 147 gp=0xc000525880 m=9 mp=0xc000580808 [syscall]:
runtime.cgocall(0x578189086fa0, 0xc0000490f8)
        runtime/cgocall.go:167 +0x4b fp=0xc0000490d0 sp=0xc000049098 pc=0x57818836bb0b
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0x73e6187044f0, 0x73e613f00d20)
        _cgo_gotypes.go:996 +0x47 fp=0xc0000490f8 sp=0xc0000490d0 pc=0x5781887a0467
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve.func2(...)
        github.com/ollama/ollama/ml/backend/ggml/ggml.go:850
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc000ece3c0)
        github.com/ollama/ollama/ml/backend/ggml/ggml.go:850 +0x125 fp=0xc000049370 sp=0xc0000490f8 pc=0x5781887ae165
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc00033f0e0, 0x1)
        github.com/ollama/ollama/runner/ollamarunner/runner.go:1162 +0xade fp=0xc0000496a0 sp=0xc000049370 pc=0x578188885c9e
github.com/ollama/ollama/runner/ollamarunner.(*Server).allocModel(0xc00033f0e0, {0x7ffef2403d54?, 0x5781886635fa?}, {0x0, 0x40, {0xc0001b9ac0, 0x1, 0x1}, 0x0}, {0x0, ...}, ...)
        github.com/ollama/ollama/runner/ollamarunner/runner.go:1219 +0x2b1 fp=0xc000049730 sp=0xc0000496a0 pc=0x5781888863d1
github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc00033f0e0, {0x578189897140, 0xc0004715e0}, 0xc0005e5cc0)
        github.com/ollama/ollama/runner/ollamarunner/runner.go:1298 +0x54d fp=0xc000049ac0 sp=0xc000049730 pc=0x578188886e0d
github.com/ollama/ollama/runner/ollamarunner.(*Server).load-fm({0x578189897140?, 0xc0004715e0?}, 0xc00065bb40?)
        <autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x5781888891b6
net/http.HandlerFunc.ServeHTTP(0xc0005d8780?, {0x578189897140?, 0xc0004715e0?}, 0xc00065bb60?)
        net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x57818866e2c9
net/http.(*ServeMux).ServeHTTP(0x578188313ce5?, {0x578189897140, 0xc0004715e0}, 0xc0005e5cc0)
        net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x5781886701c4
net/http.serverHandler.ServeHTTP({0x578189893730?}, {0x578189897140?, 0xc0004715e0?}, 0x1?)
        net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x57818868dc4e
net/http.(*conn).serve(0xc0005a06c0, {0x578189899548, 0xc0006131d0})
        net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x57818866c7c5
net/http.(*Server).Serve.gowrap3()
        net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x578188672088
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x578188376e21
created by net/http.(*Server).Serve in goroutine 1
        net/http/server.go:3454 +0x485

...

goroutine 146 gp=0xc0005256c0 m=nil [sync.WaitGroup.Wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x60?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00019fa90 sp=0xc00019fa70 pc=0x57818836ef8e
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.semacquire1(0xc00033f198, 0x0, 0x1, 0x0, 0x18)
        runtime/sema.go:188 +0x229 fp=0xc00019faf8 sp=0xc00019fa90 pc=0x57818834ef09
sync.runtime_SemacquireWaitGroup(0x0?)
        runtime/sema.go:110 +0x25 fp=0xc00019fb30 sp=0xc00019faf8 pc=0x5781883708c5
sync.(*WaitGroup).Wait(0xc00033f190?)
        sync/waitgroup.go:118 +0x48 fp=0xc00019fb58 sp=0xc00019fb30 pc=0x578188382768
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc00033f0e0, {0x578189899580, 0xc0005dac30})
        github.com/ollama/ollama/runner/ollamarunner/runner.go:441 +0x45 fp=0xc00019ffb8 sp=0xc00019fb58 pc=0x57818887f725
github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1()
        github.com/ollama/ollama/runner/ollamarunner/runner.go:1411 +0x28 fp=0xc00019ffe0 sp=0xc00019ffb8 pc=0x578188888dc8
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00019ffe8 sp=0xc00019ffe0 pc=0x578188376e21
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
        github.com/ollama/ollama/runner/ollamarunner/runner.go:1411 +0x4c9

goroutine 149 gp=0xc000525c00 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?)
        runtime/proc.go:435 +0xce fp=0xc00022f5d8 sp=0xc00022f5b8 pc=0x57818836ef8e
runtime.netpollblock(0x578188392638?, 0x883086c6?, 0x81?)
        runtime/netpoll.go:575 +0xf7 fp=0xc00022f610 sp=0xc00022f5d8 pc=0x5781883342b7
internal/poll.runtime_pollWait(0x73e63dd9ed58, 0x72)
        runtime/netpoll.go:351 +0x85 fp=0xc00022f630 sp=0xc00022f610 pc=0x57818836e1a5
internal/poll.(*pollDesc).wait(0xc0005dd400?, 0xc0006132d1?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00022f658 sp=0xc00022f630 pc=0x5781883f60e7
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0005dd400, {0xc0006132d1, 0x1, 0x1})
        internal/poll/fd_unix.go:165 +0x27a fp=0xc00022f6f0 sp=0xc00022f658 pc=0x5781883f73da
net.(*netFD).Read(0xc0005dd400, {0xc0006132d1?, 0x0?, 0x0?})
        net/fd_posix.go:55 +0x25 fp=0xc00022f738 sp=0xc00022f6f0 pc=0x57818846c3e5
net.(*conn).Read(0xc000190ab0, {0xc0006132d1?, 0x0?, 0x0?})
        net/net.go:194 +0x45 fp=0xc00022f780 sp=0xc00022f738 pc=0x57818847a7a5
net/http.(*connReader).backgroundRead(0xc0006132c0)
        net/http/server.go:690 +0x37 fp=0xc00022f7c8 sp=0xc00022f780 pc=0x578188666697
net/http.(*connReader).startBackgroundRead.gowrap2()
        net/http/server.go:686 +0x25 fp=0xc00022f7e0 sp=0xc00022f7c8 pc=0x5781886665c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00022f7e8 sp=0xc00022f7e0 pc=0x578188376e21
created by net/http.(*connReader).startBackgroundRead in goroutine 147
        net/http/server.go:686 +0xb6

rax    0x0
rbx    0xb9
rcx    0x73e684907b2c
rdx    0x6
rdi    0xb1
rsi    0xb9
rbp    0x73e635ffa310
rsp    0x73e635ffa2d0
r8     0x0
r9     0x7
r10    0x8
r11    0x246
r12    0x6
r13    0x73e5b2adaa88
r14    0x16
r15    0xc000616b20
rip    0x73e684907b2c
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
time=2025-11-26T08:00:04.512Z level=INFO source=sched.go:470 msg="Load failed" model=/root/.ollama/models/blobs/sha256-06507c7b42688469c4e7298b0a1e16deff06caf291cf0a5b278c308249c3e439 error="do load request: Post \"http://127.0.0.1:34863/load\": EOF"
time=2025-11-26T08:00:04.513Z level=ERROR source=server.go:265 msg="llama runner terminated" error="exit status 2"

OS

WSL2

GPU

Nvidia

CPU

AMD

Ollama version

0.13.0

Originally created by @YetheSamartaka on GitHub (Nov 26, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13250 ### What is the issue? I have two instances of Ollama in docker, one running at port 11434, other at 11439. I have 3x 4090, where 1st one has 18/24 vRAM and others have 17/24 and 17/24 vRAM. I want to load another model in the second instance which will take 5,2 GB of vRAM (Plenty enough to fit in when I have the sched spread enabled). The second instance of Ollama does not have any models loaded. For both Docker containers, I'm using these env variables: `-e OLLAMA_MAX_LOADED_MODELS=3 -e OLLAMA_KEEP_ALIVE=-1 -e OLLAMA_SCHED_SPREAD=1 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_KV_CACHE_TYPE=q8_0 -e OLLAMA_LOAD_TIMEOUT=60m -e OLLAMA_ORIGINS="*"` And I get following error after I try to ollama run the model qwen3-embedding-0.6b-q8_0 using command: `ollama run qwen3-embedding-0.6b-q8_0:latest "Init"` I get this error: `Error: do load request: Post "http://127.0.0.1:38057/load": EOF` But I cannot load any other models. I have to note that I already have 1 instance of this model running on the first ollama docker instance on port 11434 that is working as supposed. And if I would free up memory, I can load two of these models at both docker containers without any issues and they are working for my use case as supposed (I am then using [Nomyo Router](https://github.com/nomyo-ai/nomyo-router) so I can serve multiple instances of the same model to do load balancing) It seems that memory estimation is not working properly even tho the model should fit comfortably. More details in the log. ### Relevant log output ```shell time=2025-11-26T07:59:53.433Z level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 35507" time=2025-11-26T07:59:56.436Z level=INFO source=runner.go:449 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v12]" extra_envs=map[] error="failed to finish discovery before timeout" time=2025-11-26T07:59:56.437Z level=WARN source=runner.go:341 msg="unable to refresh free memory, using old values" time=2025-11-26T08:00:04.012Z level=INFO source=sched.go:443 msg="system memory" total="125.6 GiB" free="125.4 GiB" free_swap="32.0 GiB" time=2025-11-26T08:00:04.012Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-46e43120-44ac-f5d2-97b9-b220f8578118 library=CUDA available="8.6 GiB" free="9.1 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-11-26T08:00:04.012Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-9ee3a284-3456-57ff-f595-6f2e0b1db136 library=CUDA available="8.0 GiB" free="8.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-11-26T08:00:04.012Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-0978d7da-f602-d3ba-aedb-b466066378ac library=CUDA available="8.8 GiB" free="9.2 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-11-26T08:00:04.012Z level=INFO source=server.go:702 msg="loading model" "model layers"=29 requested=-1 time=2025-11-26T08:00:04.036Z level=INFO source=runner.go:1398 msg="starting ollama engine" time=2025-11-26T08:00:04.039Z level=INFO source=runner.go:1433 msg="Server listening on 127.0.0.1:34863" time=2025-11-26T08:00:04.047Z level=INFO source=runner.go:1271 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:64 GPULayers:29[ID:GPU-9ee3a284-3456-57ff-f595-6f2e0b1db136 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-26T08:00:04.109Z level=INFO source=ggml.go:136 msg="" architecture=qwen3 file_type=Q8_0 name="Qwen3 Embedding 0.6b" description="" num_tensors=310 num_key_values=37 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 3 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-46e43120-44ac-f5d2-97b9-b220f8578118 Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-9ee3a284-3456-57ff-f595-6f2e0b1db136 Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0978d7da-f602-d3ba-aedb-b466066378ac load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2025-11-26T08:00:04.360Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) CUDA error: out of memory current device: 1, in function ggml_cuda_set_device at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:101 cudaSetDevice(device) //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:88: CUDA error /usr/lib/ollama/libggml-base.so(+0x1a858)[0x73e63c125858] /usr/lib/ollama/libggml-base.so(ggml_print_backtrace+0x1e6)[0x73e63c125c26] /usr/lib/ollama/libggml-base.so(ggml_abort+0x11d)[0x73e63c125dad] /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x1223a2)[0x73e5b24a63a2] /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x1226fe)[0x73e5b24a66fe] /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x122fd0)[0x73e5b24a6fd0] /usr/bin/ollama(+0x10ea0de)[0x5781890ef0de] /usr/bin/ollama(+0x10ecdd8)[0x5781890f1dd8] /usr/bin/ollama(+0x1081fbb)[0x578189086fbb] /usr/bin/ollama(+0x371aa1)[0x578188376aa1] SIGABRT: abort PC=0x73e684907b2c m=9 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 147 gp=0xc000525880 m=9 mp=0xc000580808 [syscall]: runtime.cgocall(0x578189086fa0, 0xc0000490f8) runtime/cgocall.go:167 +0x4b fp=0xc0000490d0 sp=0xc000049098 pc=0x57818836bb0b github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0x73e6187044f0, 0x73e613f00d20) _cgo_gotypes.go:996 +0x47 fp=0xc0000490f8 sp=0xc0000490d0 pc=0x5781887a0467 github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve.func2(...) github.com/ollama/ollama/ml/backend/ggml/ggml.go:850 github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc000ece3c0) github.com/ollama/ollama/ml/backend/ggml/ggml.go:850 +0x125 fp=0xc000049370 sp=0xc0000490f8 pc=0x5781887ae165 github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc00033f0e0, 0x1) github.com/ollama/ollama/runner/ollamarunner/runner.go:1162 +0xade fp=0xc0000496a0 sp=0xc000049370 pc=0x578188885c9e github.com/ollama/ollama/runner/ollamarunner.(*Server).allocModel(0xc00033f0e0, {0x7ffef2403d54?, 0x5781886635fa?}, {0x0, 0x40, {0xc0001b9ac0, 0x1, 0x1}, 0x0}, {0x0, ...}, ...) github.com/ollama/ollama/runner/ollamarunner/runner.go:1219 +0x2b1 fp=0xc000049730 sp=0xc0000496a0 pc=0x5781888863d1 github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc00033f0e0, {0x578189897140, 0xc0004715e0}, 0xc0005e5cc0) github.com/ollama/ollama/runner/ollamarunner/runner.go:1298 +0x54d fp=0xc000049ac0 sp=0xc000049730 pc=0x578188886e0d github.com/ollama/ollama/runner/ollamarunner.(*Server).load-fm({0x578189897140?, 0xc0004715e0?}, 0xc00065bb40?) <autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x5781888891b6 net/http.HandlerFunc.ServeHTTP(0xc0005d8780?, {0x578189897140?, 0xc0004715e0?}, 0xc00065bb60?) net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x57818866e2c9 net/http.(*ServeMux).ServeHTTP(0x578188313ce5?, {0x578189897140, 0xc0004715e0}, 0xc0005e5cc0) net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x5781886701c4 net/http.serverHandler.ServeHTTP({0x578189893730?}, {0x578189897140?, 0xc0004715e0?}, 0x1?) net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x57818868dc4e net/http.(*conn).serve(0xc0005a06c0, {0x578189899548, 0xc0006131d0}) net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x57818866c7c5 net/http.(*Server).Serve.gowrap3() net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x578188672088 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x578188376e21 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3454 +0x485 ... goroutine 146 gp=0xc0005256c0 m=nil [sync.WaitGroup.Wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x60?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00019fa90 sp=0xc00019fa70 pc=0x57818836ef8e runtime.goparkunlock(...) runtime/proc.go:441 runtime.semacquire1(0xc00033f198, 0x0, 0x1, 0x0, 0x18) runtime/sema.go:188 +0x229 fp=0xc00019faf8 sp=0xc00019fa90 pc=0x57818834ef09 sync.runtime_SemacquireWaitGroup(0x0?) runtime/sema.go:110 +0x25 fp=0xc00019fb30 sp=0xc00019faf8 pc=0x5781883708c5 sync.(*WaitGroup).Wait(0xc00033f190?) sync/waitgroup.go:118 +0x48 fp=0xc00019fb58 sp=0xc00019fb30 pc=0x578188382768 github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc00033f0e0, {0x578189899580, 0xc0005dac30}) github.com/ollama/ollama/runner/ollamarunner/runner.go:441 +0x45 fp=0xc00019ffb8 sp=0xc00019fb58 pc=0x57818887f725 github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:1411 +0x28 fp=0xc00019ffe0 sp=0xc00019ffb8 pc=0x578188888dc8 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00019ffe8 sp=0xc00019ffe0 pc=0x578188376e21 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:1411 +0x4c9 goroutine 149 gp=0xc000525c00 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?) runtime/proc.go:435 +0xce fp=0xc00022f5d8 sp=0xc00022f5b8 pc=0x57818836ef8e runtime.netpollblock(0x578188392638?, 0x883086c6?, 0x81?) runtime/netpoll.go:575 +0xf7 fp=0xc00022f610 sp=0xc00022f5d8 pc=0x5781883342b7 internal/poll.runtime_pollWait(0x73e63dd9ed58, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc00022f630 sp=0xc00022f610 pc=0x57818836e1a5 internal/poll.(*pollDesc).wait(0xc0005dd400?, 0xc0006132d1?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00022f658 sp=0xc00022f630 pc=0x5781883f60e7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc0005dd400, {0xc0006132d1, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc00022f6f0 sp=0xc00022f658 pc=0x5781883f73da net.(*netFD).Read(0xc0005dd400, {0xc0006132d1?, 0x0?, 0x0?}) net/fd_posix.go:55 +0x25 fp=0xc00022f738 sp=0xc00022f6f0 pc=0x57818846c3e5 net.(*conn).Read(0xc000190ab0, {0xc0006132d1?, 0x0?, 0x0?}) net/net.go:194 +0x45 fp=0xc00022f780 sp=0xc00022f738 pc=0x57818847a7a5 net/http.(*connReader).backgroundRead(0xc0006132c0) net/http/server.go:690 +0x37 fp=0xc00022f7c8 sp=0xc00022f780 pc=0x578188666697 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc00022f7e0 sp=0xc00022f7c8 pc=0x5781886665c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00022f7e8 sp=0xc00022f7e0 pc=0x578188376e21 created by net/http.(*connReader).startBackgroundRead in goroutine 147 net/http/server.go:686 +0xb6 rax 0x0 rbx 0xb9 rcx 0x73e684907b2c rdx 0x6 rdi 0xb1 rsi 0xb9 rbp 0x73e635ffa310 rsp 0x73e635ffa2d0 r8 0x0 r9 0x7 r10 0x8 r11 0x246 r12 0x6 r13 0x73e5b2adaa88 r14 0x16 r15 0xc000616b20 rip 0x73e684907b2c rflags 0x246 cs 0x33 fs 0x0 gs 0x0 time=2025-11-26T08:00:04.512Z level=INFO source=sched.go:470 msg="Load failed" model=/root/.ollama/models/blobs/sha256-06507c7b42688469c4e7298b0a1e16deff06caf291cf0a5b278c308249c3e439 error="do load request: Post \"http://127.0.0.1:34863/load\": EOF" time=2025-11-26T08:00:04.513Z level=ERROR source=server.go:265 msg="llama runner terminated" error="exit status 2" ``` ### OS WSL2 ### GPU Nvidia ### CPU AMD ### Ollama version 0.13.0
GiteaMirror added the bugneeds more info labels 2026-04-29 08:41:25 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 26, 2025):

$ ollama pull qwen3-embedding-0.6b-q8_0:latest
pulling manifest 
Error: pull model manifest: file does not exist

Is it the same model as qwen3-embedding:0.6b-q8_0? What changes are in the Modelfile?

<!-- gh-comment-id:3580109260 --> @rick-github commented on GitHub (Nov 26, 2025): ```console $ ollama pull qwen3-embedding-0.6b-q8_0:latest pulling manifest Error: pull model manifest: file does not exist ``` Is it the same model as qwen3-embedding:0.6b-q8_0? What changes are in the Modelfile?
Author
Owner

@YetheSamartaka commented on GitHub (Nov 26, 2025):

@rick-github
Yes, it is the same model - https://ollama.com/library/qwen3-embedding:0.6b-q8_0. I also have a variant that has 16K ctx with these in the modelfile:

FROM qwen3-embedding:0.6b-q8_0
PARAMETER num_ctx 16384
<!-- gh-comment-id:3580168666 --> @YetheSamartaka commented on GitHub (Nov 26, 2025): @rick-github Yes, it is the same model - https://ollama.com/library/qwen3-embedding:0.6b-q8_0. I also have a variant that has 16K ctx with these in the modelfile: ``` FROM qwen3-embedding:0.6b-q8_0 PARAMETER num_ctx 16384 ```
Author
Owner

@rick-github commented on GitHub (Nov 26, 2025):

What version of ollama (ollama -v)?

<!-- gh-comment-id:3580192525 --> @rick-github commented on GitHub (Nov 26, 2025): What version of ollama (`ollama -v`)?
Author
Owner

@YetheSamartaka commented on GitHub (Nov 26, 2025):

@rick-github
edit: I mistakenly put 1 instead of 0 at the beginning xD

0.13.0 as stated in the opening message.

<!-- gh-comment-id:3580340016 --> @YetheSamartaka commented on GitHub (Nov 26, 2025): @rick-github edit: I mistakenly put 1 instead of 0 at the beginning xD 0.13.0 as stated in the opening message.
Author
Owner

@adex345 commented on GitHub (Nov 27, 2025):

I have same issue but I use default port and no docker. After update I can finnaly use llama 3b with rocm 7.1 (it was using cpu before) but qwen3 30b and qwen3 coder 30b doesn't work:

lis 27 19:25:54 cachyos-x64 ollama[1144524]: time=2025-11-27T19:25:54.665+01:00 level=INFO source=sched.go:470 msg="Load failed" model=/var/lib/ollama/blobs/sha256-78b329e716e7e9775973d392cd132b1f1ff1c8287a992887caeb6fd6c56ba9cc error="do load request: Post \"http://127.0.0.1:34163/load\": EOF"
lis 27 19:25:54 cachyos-x64 ollama[1144524]: time=2025-11-27T19:25:54.665+01:00 level=DEBUG source=server.go:1755 msg="stopping llama server" pid=1144796
lis 27 19:25:54 cachyos-x64 ollama[1144524]: time=2025-11-27T19:25:54.665+01:00 level=DEBUG source=server.go:1761 msg="waiting for llama server to exit" pid=1144796
lis 27 19:25:54 cachyos-x64 ollama[1144524]: time=2025-11-27T19:25:54.666+01:00 level=ERROR source=server.go:265 msg="llama runner terminated" error="exit status 2"
lis 27 19:25:54 cachyos-x64 ollama[1144524]: time=2025-11-27T19:25:54.666+01:00 level=DEBUG source=server.go:1765 msg="llama server stopped" pid=1144796
lis 27 19:25:54 cachyos-x64 ollama[1144524]: [GIN] 2025/11/27 - 19:25:54 | 500 |  613.560082ms |       127.0.0.1 | POST     "/api/generate"
<!-- gh-comment-id:3586998594 --> @adex345 commented on GitHub (Nov 27, 2025): I have same issue but I use default port and no docker. After update I can finnaly use llama 3b with rocm 7.1 (it was using cpu before) but qwen3 30b and qwen3 coder 30b doesn't work: ```` lis 27 19:25:54 cachyos-x64 ollama[1144524]: time=2025-11-27T19:25:54.665+01:00 level=INFO source=sched.go:470 msg="Load failed" model=/var/lib/ollama/blobs/sha256-78b329e716e7e9775973d392cd132b1f1ff1c8287a992887caeb6fd6c56ba9cc error="do load request: Post \"http://127.0.0.1:34163/load\": EOF" lis 27 19:25:54 cachyos-x64 ollama[1144524]: time=2025-11-27T19:25:54.665+01:00 level=DEBUG source=server.go:1755 msg="stopping llama server" pid=1144796 lis 27 19:25:54 cachyos-x64 ollama[1144524]: time=2025-11-27T19:25:54.665+01:00 level=DEBUG source=server.go:1761 msg="waiting for llama server to exit" pid=1144796 lis 27 19:25:54 cachyos-x64 ollama[1144524]: time=2025-11-27T19:25:54.666+01:00 level=ERROR source=server.go:265 msg="llama runner terminated" error="exit status 2" lis 27 19:25:54 cachyos-x64 ollama[1144524]: time=2025-11-27T19:25:54.666+01:00 level=DEBUG source=server.go:1765 msg="llama server stopped" pid=1144796 lis 27 19:25:54 cachyos-x64 ollama[1144524]: [GIN] 2025/11/27 - 19:25:54 | 500 | 613.560082ms | 127.0.0.1 | POST "/api/generate" ````
Author
Owner

@YetheSamartaka commented on GitHub (Dec 8, 2025):

Issue is still present on version 0.13.1

<!-- gh-comment-id:3625725272 --> @YetheSamartaka commented on GitHub (Dec 8, 2025): Issue is still present on version 0.13.1
Author
Owner

@adex345 commented on GitHub (Dec 8, 2025):

Issue is still present on version 0.13.1

I can confirm. Quick workaround is to use CPU only or Vulcan but CPU only is quicker on my system. (Vulcan 9t/s, CPU only 15t/s, 50% ROCm with 50% CPU 30t/s)

<!-- gh-comment-id:3626822105 --> @adex345 commented on GitHub (Dec 8, 2025): > Issue is still present on version 0.13.1 I can confirm. Quick workaround is to use CPU only or Vulcan but CPU only is quicker on my system. (Vulcan 9t/s, CPU only 15t/s, 50% ROCm with 50% CPU 30t/s)
Author
Owner

@rick-github commented on GitHub (Jan 14, 2026):

Unable to repro. Set OLLAMA_DEBUG=2 in the server environment and post the full log. Note that this will include the prompt so be aware of PII.

<!-- gh-comment-id:3749042202 --> @rick-github commented on GitHub (Jan 14, 2026): Unable to repro. Set `OLLAMA_DEBUG=2` in the server environment and post the full log. Note that this will include the prompt so be aware of PII.
Author
Owner

@YetheSamartaka commented on GitHub (Jan 14, 2026):

@rick-github with the new 0.14.0 version of Ollama?

<!-- gh-comment-id:3749071139 --> @YetheSamartaka commented on GitHub (Jan 14, 2026): @rick-github with the new 0.14.0 version of Ollama?
Author
Owner

@rick-github commented on GitHub (Jan 14, 2026):

Does it fail with 0.14.0?

<!-- gh-comment-id:3749352412 --> @rick-github commented on GitHub (Jan 14, 2026): Does it fail with 0.14.0?
Author
Owner

@YetheSamartaka commented on GitHub (Jan 15, 2026):

@rick-github I confirm that in 0.14.1 it is working much better and I was able to load all those models into vRAM. Thank you very much for fixing it.

<!-- gh-comment-id:3755204675 --> @YetheSamartaka commented on GitHub (Jan 15, 2026): @rick-github I confirm that in 0.14.1 it is working much better and I was able to load all those models into vRAM. Thank you very much for fixing it.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55271