[GH-ISSUE #11930] Cuda Errors for long context gpt-oss:120b #54430

Closed
opened 2026-04-29 05:55:54 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @jsearcy1 on GitHub (Aug 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11930

What is the issue?

Hi All,

I'm trying to run some large context text through gpt-oss:120b, I created a new model with PARAMETER num_ctx 128000, but when I try anything that long I get the cuda crash in the log below (CUDA error
No symbol table is loaded.). I'm running on an H200, and still have about 40GB of memory free when the error occurs (according to nvidia-smi). I'm guessing there might be a 16-bit integer overflowing, because I can run successfully at 65K but it fails at 70K. Thanks for taking a look.

Relevant log output

time=2025-08-15T16:15:41.248-07:00 level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1,2,3 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-08-15T16:15:41.457-07:00 level=INFO source=images.go:477 msg="total blobs: 44"
time=2025-08-15T16:15:41.458-07:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-08-15T16:15:41.459-07:00 level=INFO source=routes.go:1357 msg="Listening on 127.0.0.1:11434 (version 0.11.4)"
time=2025-08-15T16:15:41.459-07:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-15T16:15:46.740-07:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-2691ebd7-3aea-1611-a8d3-12b456a33b86 library=cuda variant=v12 compute=9.0 driver=12.4 name="NVIDIA H200 NVL" total="139.7 GiB" available="139.2 GiB"
time=2025-08-15T16:15:46.740-07:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-2f9f9c52-a998-be7f-da5f-2e896bf6c402 library=cuda variant=v12 compute=9.0 driver=12.4 name="NVIDIA H200 NVL" total="139.7 GiB" available="139.2 GiB"
time=2025-08-15T16:15:46.740-07:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-1084798f-4308-377a-2b42-496b5f767a32 library=cuda variant=v12 compute=9.0 driver=12.4 name="NVIDIA H200 NVL" total="139.7 GiB" available="139.2 GiB"
time=2025-08-15T16:15:46.740-07:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-23ca7ec3-a9fa-1140-0e40-d54592d4225b library=cuda variant=v12 compute=9.0 driver=12.4 name="NVIDIA H200 NVL" total="139.7 GiB" available="139.2 GiB"
time=2025-08-15T16:17:49.128-07:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 gpu=GPU-2691ebd7-3aea-1611-a8d3-12b456a33b86 parallel=1 available=149467627520 required="113.2 GiB"
time=2025-08-15T16:17:50.116-07:00 level=INFO source=server.go:135 msg="system memory" total="1510.4 GiB" free="1497.9 GiB" free_swap="0 B"
time=2025-08-15T16:17:50.117-07:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="" memory.available="[139.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="113.2 GiB" memory.required.partial="113.2 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[113.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB"
time=2025-08-15T16:17:50.152-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/venv/bin/ollama runner --ollama-engine --model .ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 37 --threads 128 --parallel 1 --port 40311"
time=2025-08-15T16:17:50.156-07:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-15T16:17:50.156-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-15T16:17:50.159-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-15T16:17:50.162-07:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-15T16:17:50.162-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:40311"
time=2025-08-15T16:17:50.218-07:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
time=2025-08-15T16:17:50.412-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-15T16:17:50.903-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-15T16:17:51.654-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA H200 NVL, compute capability 9.0, VMM: yes
load_backend: loaded CUDA backend from venv/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from venv/lib/ollama/libggml-cpu-icelake.so
time=2025-08-15T16:17:53.280-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-15T16:17:53.436-07:00 level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU"
time=2025-08-15T16:17:53.436-07:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-15T16:17:53.436-07:00 level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU"
time=2025-08-15T16:17:53.436-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="59.7 GiB"
time=2025-08-15T16:17:53.436-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-08-15T16:17:53.445-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="31.5 GiB"
time=2025-08-15T16:17:53.445-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB"
time=2025-08-15T16:18:45.654-07:00 level=INFO source=server.go:637 msg="llama runner started in 55.42 seconds"

[GIN] 2025/08/15 - 16:20:04 | 200 |   66.940872ms |       127.0.0.1 | POST     "/api/show"
time=2025-08-15T16:20:15.976-07:00 level=WARN source=runner.go:157 msg="truncating input prompt" limit=128000 prompt=299074 keep=4 new=128000
ggml_cuda_compute_forward: SCALE failed
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2377
  err
//ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:77: CUDA error
No symbol table is loaded.  Use the "file" command.
[New LWP 216260]
[New LWP 216261]
[New LWP 216262]
[New LWP 216265]
[New LWP 216268]
[New LWP 216269]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x000056470106084e in ?? ()
No symbol "frame" in current context.
[Inferior 1 (process 216259) detached]
SIGABRT: abort
PC=0x14bc86cb552f m=5 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 9 gp=0xc000161880 m=5 mp=0xc000258008 [syscall]:
runtime.cgocall(0x564701d93ec0, 0xc0004e3a58)
	runtime/cgocall.go:167 +0x4b fp=0xc0004e3a30 sp=0xc0004e39f8 pc=0x5647010c68cb
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x564704ce6930, 0x14bba4dfe790)
	_cgo_gotypes.go:886 +0x4a fp=0xc0004e3a58 sp=0xc0004e3a30 pc=0x564701500c6a
github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute.func1(...)
	github.com/ollama/ollama/ml/backend/ggml/ggml.go:627
github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute(0xc000256000, {0xc00027c620, 0x1, 0x0?})
	github.com/ollama/ollama/ml/backend/ggml/ggml.go:627 +0x9d fp=0xc0004e3b00 sp=0xc0004e3a58 pc=0x56470150c17d
github.com/ollama/ollama/model.Forward({0x564702447850, 0xc000256000}, {0x56470243e070, 0xc0000c96b0}, {0xc0011c6800, 0x200, 0x200}, {{0x5647024524a8, 0xc0019ea048}, {0x0, ...}, ...})
	github.com/ollama/ollama/model/model.go:305 +0x2a7 fp=0xc0004e3be8 sp=0xc0004e3b00 pc=0x56470151a147
github.com/ollama/ollama/runner/ollamarunner.(*Server).processBatch(0xc00041f9e0)
	github.com/ollama/ollama/runner/ollamarunner/runner.go:480 +0x4c5 fp=0xc0004e3f98 sp=0xc0004e3be8 pc=0x5647015bb085
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc00041f9e0, {0x56470243f550, 0xc000437e00})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:362 +0x4e fp=0xc0004e3fb8 sp=0xc0004e3f98 pc=0x5647015bab6e
github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap2()
	github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0x28 fp=0xc0004e3fe0 sp=0xc0004e3fb8 pc=0x5647015c02c8
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0004e3fe8 sp=0xc0004e3fe0 pc=0x5647010d1481
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0xa74

goroutine 1 gp=0xc000002380 m=nil [IO wait, 5 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc0004e5650 sp=0xc0004e5630 pc=0x5647010c9d4e
runtime.netpollblock(0xc0000576a0?, 0x1062b46?, 0x47?)
	runtime/netpoll.go:575 +0xf7 fp=0xc0004e5688 sp=0xc0004e5650 pc=0x56470108e837
internal/poll.runtime_pollWait(0x14bc88209eb0, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc0004e56a8 sp=0xc0004e5688 pc=0x5647010c8f65
internal/poll.(*pollDesc).wait(0xc00048d500?, 0x900000036?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0004e56d0 sp=0xc0004e56a8 pc=0x5647011503a7
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc00048d500)
	internal/poll/fd_unix.go:620 +0x295 fp=0xc0004e5778 sp=0xc0004e56d0 pc=0x564701155775
net.(*netFD).accept(0xc00048d500)
	net/fd_unix.go:172 +0x29 fp=0xc0004e5830 sp=0xc0004e5778 pc=0x5647011c7d89
net.(*TCPListener).accept(0xc000461d00)
	net/tcpsock_posix.go:159 +0x1b fp=0xc0004e5880 sp=0xc0004e5830 pc=0x5647011dd73b
net.(*TCPListener).Accept(0xc000461d00)
	net/tcpsock.go:380 +0x30 fp=0xc0004e58b0 sp=0xc0004e5880 pc=0x5647011dc5f0
net/http.(*onceCloseListener).Accept(0xc000230360?)
	<autogenerated>:1 +0x24 fp=0xc0004e58c8 sp=0xc0004e58b0 pc=0x5647013f3d44
net/http.(*Server).Serve(0xc0004b6d00, {0x56470243d0a8, 0xc000461d00})
	net/http/server.go:3424 +0x30c fp=0xc0004e59f8 sp=0xc0004e58c8 pc=0x5647013cb60c
github.com/ollama/ollama/runner/ollamarunner.Execute({0xc000034150, 0xe, 0xf})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:984 +0xe09 fp=0xc0004e5d08 sp=0xc0004e59f8 pc=0x5647015c0029
github.com/ollama/ollama/runner.Execute({0xc000034130?, 0x0?, 0x0?})
	github.com/ollama/ollama/runner/runner.go:20 +0xc9 fp=0xc0004e5d30 sp=0xc0004e5d08 pc=0x5647015c0929
github.com/ollama/ollama/cmd.NewCLI.func2(0xc0004b6b00?, {0x564701f8007e?, 0x4?, 0x564701f80082?})
	github.com/ollama/ollama/cmd/cmd.go:1583 +0x45 fp=0xc0004e5d58 sp=0xc0004e5d30 pc=0x564701d25e25
github.com/spf13/cobra.(*Command).execute(0xc0004bb508, {0xc0003090e0, 0xf, 0xf})
	github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc0004e5e78 sp=0xc0004e5d58 pc=0x5647012413dc
github.com/spf13/cobra.(*Command).ExecuteC(0xc000446f08)
	github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc0004e5f30 sp=0xc0004e5e78 pc=0x564701241c25
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
	github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
	github.com/ollama/ollama/main.go:12 +0x4d fp=0xc0004e5f50 sp=0xc0004e5f30 pc=0x564701d2690d
runtime.main()
	runtime/proc.go:283 +0x29d fp=0xc0004e5fe0 sp=0xc0004e5f50 pc=0x564701095ebd
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0004e5fe8 sp=0xc0004e5fe0 pc=0x5647010d1481

goroutine 2 gp=0xc000002e00 m=nil [force gc (idle), 5 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x5647010c9d4e
runtime.goparkunlock(...)
	runtime/proc.go:441
runtime.forcegchelper()
	runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x5647010961f8
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x5647010d1481
created by runtime.init.7 in goroutine 1
	runtime/proc.go:336 +0x1a

goroutine 3 gp=0xc000002fc0 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000073780 sp=0xc000073760 pc=0x5647010c9d4e
runtime.goparkunlock(...)
	runtime/proc.go:441
runtime.bgsweep(0xc000046080)
	runtime/mgcsweep.go:316 +0xdf fp=0xc0000737c8 sp=0xc000073780 pc=0x56470108099f
runtime.gcenable.gowrap1()
	runtime/mgc.go:204 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x564701074d85
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x5647010d1481
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000003180 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x432835?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000073f78 sp=0xc000073f58 pc=0x5647010c9d4e
runtime.goparkunlock(...)
	runtime/proc.go:441
runtime.(*scavengerState).park(0x564702cd49a0)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000073fa8 sp=0xc000073f78 pc=0x56470107e3e9
runtime.bgscavenge(0xc000046080)
	runtime/mgcscavenge.go:658 +0x59 fp=0xc000073fc8 sp=0xc000073fa8 pc=0x56470107e979
runtime.gcenable.gowrap2()
	runtime/mgc.go:205 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x564701074d25
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x5647010d1481
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:205 +0xa5

goroutine 5 gp=0xc000003a40 m=nil [finalizer wait, 5 minutes]:
runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000072688?)
	runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x5647010c9d4e
runtime.runfinq()
	runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x564701073d47
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x5647010d1481
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:166 +0x3d

goroutine 6 gp=0xc000160540 m=nil [chan receive]:
runtime.gopark(0xc0000ed9a0?, 0xc0019ea018?, 0x60?, 0x5f?, 0x5647011ae9c8?)
	runtime/proc.go:435 +0xce fp=0xc000085f18 sp=0xc000085ef8 pc=0x5647010c9d4e
runtime.chanrecv(0xc00004a380, 0x0, 0x1)
	runtime/chan.go:664 +0x445 fp=0xc000085f90 sp=0xc000085f18 pc=0x564701065725
runtime.chanrecv1(0x0?, 0x0?)
	runtime/chan.go:506 +0x12 fp=0xc000085fb8 sp=0xc000085f90 pc=0x5647010652b2
runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
	runtime/mgc.go:1796
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
	runtime/mgc.go:1799 +0x2f fp=0xc000085fe0 sp=0xc000085fb8 pc=0x564701077f2f
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x5647010d1481
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
	runtime/mgc.go:1794 +0x85

goroutine 7 gp=0xc000160e00 m=nil [GC worker (idle)]:
runtime.gopark(0x51814e92dd7b?, 0x3?, 0xe9?, 0xaf?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000074f38 sp=0xc000074f18 pc=0x5647010c9d4e
runtime.gcBgMarkWorker(0xc00004b960)
	runtime/mgc.go:1423 +0xe9 fp=0xc000074fc8 sp=0xc000074f38 pc=0x564701077249
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1339 +0x25 fp=0xc000074fe0 sp=0xc000074fc8 pc=0x564701077125
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x5647010d1481
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1339 +0x105

goroutine 709 gp=0xc0001616c0 m=nil [IO wait, 2 minutes]:
runtime.gopark(0x5?, 0x14bc88382108?, 0x60?, 0x0?, 0xb?)
	runtime/proc.go:435 +0xce fp=0xc000075dd8 sp=0xc000075db8 pc=0x5647010c9d4e
runtime.netpollblock(0x5647010ed0b8?, 0x1062b46?, 0x47?)
	runtime/netpoll.go:575 +0xf7 fp=0xc000075e10 sp=0xc000075dd8 pc=0x56470108e837
internal/poll.runtime_pollWait(0x14bc88209c80, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc000075e30 sp=0xc000075e10 pc=0x5647010c8f65
internal/poll.(*pollDesc).wait(0xc00048c700?, 0xc0004b53f1?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000075e58 sp=0xc000075e30 pc=0x5647011503a7
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00048c700, {0xc0004b53f1, 0x1, 0x1})
	internal/poll/fd_unix.go:165 +0x27a fp=0xc000075ef0 sp=0xc000075e58 pc=0x56470115169a
net.(*netFD).Read(0xc00048c700, {0xc0004b53f1?, 0xc000460218?, 0xc000075f70?})
	net/fd_posix.go:55 +0x25 fp=0xc000075f38 sp=0xc000075ef0 pc=0x5647011c5de5
net.(*conn).Read(0xc00005e2e8, {0xc0004b53f1?, 0x5647015bab6e?, 0xc00041f9e0?})
	net/net.go:194 +0x45 fp=0xc000075f80 sp=0xc000075f38 pc=0x5647011d41a5
net/http.(*connReader).backgroundRead(0xc0004b53e0)
	net/http/server.go:690 +0x37 fp=0xc000075fc8 sp=0xc000075f80 pc=0x5647013c0017
net/http.(*connReader).startBackgroundRead.gowrap2()
	net/http/server.go:686 +0x25 fp=0xc000075fe0 sp=0xc000075fc8 pc=0x5647013bff45
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000075fe8 sp=0xc000075fe0 pc=0x5647010d1481
created by net/http.(*connReader).startBackgroundRead in goroutine 14
	net/http/server.go:686 +0xb6

goroutine 14 gp=0xc000160700 m=nil [select, 2 minutes]:
runtime.gopark(0xc0004e7a10?, 0x2?, 0x0?, 0xfa?, 0xc0004e7874?)
	runtime/proc.go:435 +0xce fp=0xc0004e76a0 sp=0xc0004e7680 pc=0x5647010c9d4e
runtime.selectgo(0xc0004e7a10, 0xc0004e7870, 0x1f400?, 0x0, 0x4?, 0x1)
	runtime/select.go:351 +0x837 fp=0xc0004e77d8 sp=0xc0004e76a0 pc=0x5647010a83b7
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc00041f9e0, {0x56470243d288, 0xc003fee1c0}, 0xc0004743c0)
	github.com/ollama/ollama/runner/ollamarunner/runner.go:680 +0xb65 fp=0xc0004e7ac0 sp=0xc0004e77d8 pc=0x5647015bd3c5
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x56470243d288?, 0xc003fee1c0?}, 0xc0004e7b40?)
	<autogenerated>:1 +0x36 fp=0xc0004e7af0 sp=0xc0004e7ac0 pc=0x5647015c0796
net/http.HandlerFunc.ServeHTTP(0xc00040d680?, {0x56470243d288?, 0xc003fee1c0?}, 0xc0004e7b60?)
	net/http/server.go:2294 +0x29 fp=0xc0004e7b18 sp=0xc0004e7af0 pc=0x5647013c7c49
net/http.(*ServeMux).ServeHTTP(0x56470106e265?, {0x56470243d288, 0xc003fee1c0}, 0xc0004743c0)
	net/http/server.go:2822 +0x1c4 fp=0xc0004e7b68 sp=0xc0004e7b18 pc=0x5647013c9b44
net/http.serverHandler.ServeHTTP({0x5647024398d0?}, {0x56470243d288?, 0xc003fee1c0?}, 0x1?)
	net/http/server.go:3301 +0x8e fp=0xc0004e7b98 sp=0xc0004e7b68 pc=0x5647013e75ce
net/http.(*conn).serve(0xc000230360, {0x56470243f518, 0xc0004b4ea0})
	net/http/server.go:2102 +0x625 fp=0xc0004e7fb8 sp=0xc0004e7b98 pc=0x5647013c6145
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3454 +0x28 fp=0xc0004e7fe0 sp=0xc0004e7fb8 pc=0x5647013cba08
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0004e7fe8 sp=0xc0004e7fe0 pc=0x5647010d1481
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3454 +0x485

rax    0x0
rbx    0x6
rcx    0x14bc86cb552f
rdx    0x0
rdi    0x2
rsi    0x14bc3fb828e0
rbp    0x14bbe37140e5
rsp    0x14bc3fb828e0
r8     0x0
r9     0x14bc3fb828e0
r10    0x8
r11    0x246
r12    0x14bbe3714648
r13    0x4d
r14    0x14bc3f6904f8
r15    0x564704ce62d0
rip    0x14bc86cb552f
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
time=2025-08-15T16:22:51.487-07:00 level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:40311/completion\": EOF"
[GIN] 2025/08/15 - 16:22:51 | 500 |         2m35s |       127.0.0.1 | POST     "/api/chat"
time=2025-08-15T16:22:51.669-07:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 2"
time=2025-08-15T16:22:57.871-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.953556091 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216259 runner.model=/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-08-15T16:22:59.399-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=7.481591271 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216259 runner.model=/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-08-15T16:23:00.939-07:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 gpu=GPU-2691ebd7-3aea-1611-a8d3-12b456a33b86 parallel=1 available=149467627520 required="113.2 GiB"
time=2025-08-15T16:23:01.947-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=10.029605582 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216259 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-08-15T16:23:03.457-07:00 level=INFO source=server.go:135 msg="system memory" total="1510.4 GiB" free="1497.8 GiB" free_swap="0 B"
time=2025-08-15T16:23:03.459-07:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="" memory.available="[139.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="113.2 GiB" memory.required.partial="113.2 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[113.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB"
time=2025-08-15T16:23:03.498-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="venv/bin/ollama runner --ollama-engine --model .ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 37 --threads 128 --parallel 1 --port 46171"
time=2025-08-15T16:23:03.502-07:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-15T16:23:03.502-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-15T16:23:03.505-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-15T16:23:03.506-07:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-15T16:23:03.506-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:46171"
time=2025-08-15T16:23:03.543-07:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
time=2025-08-15T16:23:03.756-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA H200 NVL, compute capability 9.0, VMM: yes
load_backend: loaded CUDA backend from venv/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from venv/lib/ollama/libggml-cpu-icelake.so
time=2025-08-15T16:23:04.066-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-15T16:23:04.418-07:00 level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU"
time=2025-08-15T16:23:04.418-07:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-15T16:23:04.418-07:00 level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU"
time=2025-08-15T16:23:04.418-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="59.7 GiB"
time=2025-08-15T16:23:04.418-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-08-15T16:23:04.448-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="31.5 GiB"
time=2025-08-15T16:23:04.448-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB"
time=2025-08-15T16:23:55.354-07:00 level=INFO source=server.go:637 msg="llama runner started in 51.85 seconds"
time=2025-08-15T16:23:55.813-07:00 level=WARN source=runner.go:157 msg="truncating input prompt" limit=128000 prompt=299074 keep=4 new=128000
ggml_cuda_compute_forward: SCALE failed
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2377
  err
//ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:77: CUDA error
No symbol table is loaded.  Use the "file" command.
[New LWP 216895]
[New LWP 216896]
[New LWP 216897]
[New LWP 216898]
[New LWP 216899]
[New LWP 216900]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x0000149c03149312 in waitpid () from /lib64/libpthread.so.0
[Inferior 1 (process 216894) detached]
No symbol "frame" in current context.
SIGABRT: abort
PC=0x149c0207352f m=0 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 9 gp=0xc000161340 m=0 mp=0x5631887777c0 [syscall]:
runtime.cgocall(0x563187833ec0, 0xc00019da58)
	runtime/cgocall.go:167 +0x4b fp=0xc00019da30 sp=0xc00019d9f8 pc=0x563186b668cb
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x563189ed8ac0, 0x56318a32fad0)
	_cgo_gotypes.go:886 +0x4a fp=0xc00019da58 sp=0xc00019da30 pc=0x563186fa0c6a
github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute.func1(...)
	github.com/ollama/ollama/ml/backend/ggml/ggml.go:627
github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute(0xc00045c300, {0xc0011924b0, 0x1, 0x0?})
	github.com/ollama/ollama/ml/backend/ggml/ggml.go:627 +0x9d fp=0xc00019db00 sp=0xc00019da58 pc=0x563186fac17d
github.com/ollama/ollama/model.Forward({0x563187ee7850, 0xc00045c300}, {0x563187ede070, 0xc0000c96b0}, {0xc0018e3000, 0x200, 0x200}, {{0x563187ef24a8, 0xc0011a4000}, {0x0, ...}, ...})
	github.com/ollama/ollama/model/model.go:305 +0x2a7 fp=0xc00019dbe8 sp=0xc00019db00 pc=0x563186fba147
github.com/ollama/ollama/runner/ollamarunner.(*Server).processBatch(0xc000409680)
	github.com/ollama/ollama/runner/ollamarunner/runner.go:480 +0x4c5 fp=0xc00019df98 sp=0xc00019dbe8 pc=0x56318705b085
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc000409680, {0x563187edf550, 0xc00043de00})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:362 +0x4e fp=0xc00019dfb8 sp=0xc00019df98 pc=0x56318705ab6e
github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap2()
	github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0x28 fp=0xc00019dfe0 sp=0xc00019dfb8 pc=0x5631870602c8
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00019dfe8 sp=0xc00019dfe0 pc=0x563186b71481
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0xa74

goroutine 1 gp=0xc000002380 m=nil [IO wait, 3 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc00019f650 sp=0xc00019f630 pc=0x563186b69d4e
runtime.netpollblock(0xc00019f6a0?, 0x86b02b46?, 0x31?)
	runtime/netpoll.go:575 +0xf7 fp=0xc00019f688 sp=0xc00019f650 pc=0x563186b2e837
internal/poll.runtime_pollWait(0x149c035c7eb0, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc00019f6a8 sp=0xc00019f688 pc=0x563186b68f65
internal/poll.(*pollDesc).wait(0xc0003f2080?, 0x900b0ce3e?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00019f6d0 sp=0xc00019f6a8 pc=0x563186bf03a7
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0003f2080)
	internal/poll/fd_unix.go:620 +0x295 fp=0xc00019f778 sp=0xc00019f6d0 pc=0x563186bf5775
net.(*netFD).accept(0xc0003f2080)
	net/fd_unix.go:172 +0x29 fp=0xc00019f830 sp=0xc00019f778 pc=0x563186c67d89
net.(*TCPListener).accept(0xc00045dd00)
	net/tcpsock_posix.go:159 +0x1b fp=0xc00019f880 sp=0xc00019f830 pc=0x563186c7d73b
net.(*TCPListener).Accept(0xc00045dd00)
	net/tcpsock.go:380 +0x30 fp=0xc00019f8b0 sp=0xc00019f880 pc=0x563186c7c5f0
net/http.(*onceCloseListener).Accept(0xc00047a3f0?)
	<autogenerated>:1 +0x24 fp=0xc00019f8c8 sp=0xc00019f8b0 pc=0x563186e93d44
net/http.(*Server).Serve(0xc000163400, {0x563187edd0a8, 0xc00045dd00})
	net/http/server.go:3424 +0x30c fp=0xc00019f9f8 sp=0xc00019f8c8 pc=0x563186e6b60c
github.com/ollama/ollama/runner/ollamarunner.Execute({0xc000034150, 0xe, 0xf})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:984 +0xe09 fp=0xc00019fd08 sp=0xc00019f9f8 pc=0x563187060029
github.com/ollama/ollama/runner.Execute({0xc000034130?, 0x0?, 0x0?})
	github.com/ollama/ollama/runner/runner.go:20 +0xc9 fp=0xc00019fd30 sp=0xc00019fd08 pc=0x563187060929
github.com/ollama/ollama/cmd.NewCLI.func2(0xc000163200?, {0x563187a2007e?, 0x4?, 0x563187a20082?})
	github.com/ollama/ollama/cmd/cmd.go:1583 +0x45 fp=0xc00019fd58 sp=0xc00019fd30 pc=0x5631877c5e25
github.com/spf13/cobra.(*Command).execute(0xc00047cf08, {0xc0003065a0, 0xf, 0xf})
	github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc00019fe78 sp=0xc00019fd58 pc=0x563186ce13dc
github.com/spf13/cobra.(*Command).ExecuteC(0xc000434f08)
	github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc00019ff30 sp=0xc00019fe78 pc=0x563186ce1c25
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
	github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
	github.com/ollama/ollama/main.go:12 +0x4d fp=0xc00019ff50 sp=0xc00019ff30 pc=0x5631877c690d
runtime.main()
	runtime/proc.go:283 +0x29d fp=0xc00019ffe0 sp=0xc00019ff50 pc=0x563186b35ebd
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00019ffe8 sp=0xc00019ffe0 pc=0x563186b71481

goroutine 2 gp=0xc000002e00 m=nil [force gc (idle), 3 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x563186b69d4e
runtime.goparkunlock(...)
	runtime/proc.go:441
runtime.forcegchelper()
	runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x563186b361f8
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x563186b71481
created by runtime.init.7 in goroutine 1
	runtime/proc.go:336 +0x1a

goroutine 3 gp=0xc000002fc0 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000073780 sp=0xc000073760 pc=0x563186b69d4e
runtime.goparkunlock(...)
	runtime/proc.go:441
runtime.bgsweep(0xc000046080)
	runtime/mgcsweep.go:316 +0xdf fp=0xc0000737c8 sp=0xc000073780 pc=0x563186b2099f
runtime.gcenable.gowrap1()
	runtime/mgc.go:204 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x563186b14d85
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x563186b71481
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000003180 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0xfe742?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000073f78 sp=0xc000073f58 pc=0x563186b69d4e
runtime.goparkunlock(...)
	runtime/proc.go:441
runtime.(*scavengerState).park(0x5631887749a0)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000073fa8 sp=0xc000073f78 pc=0x563186b1e3e9
runtime.bgscavenge(0xc000046080)
	runtime/mgcscavenge.go:658 +0x59 fp=0xc000073fc8 sp=0xc000073fa8 pc=0x563186b1e979
runtime.gcenable.gowrap2()
	runtime/mgc.go:205 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x563186b14d25
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x563186b71481
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:205 +0xa5

goroutine 5 gp=0xc000003a40 m=nil [finalizer wait, 3 minutes]:
runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000072688?)
	runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x563186b69d4e
runtime.runfinq()
	runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x563186b13d47
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x563186b71481
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:166 +0x3d

goroutine 6 gp=0xc000160540 m=nil [chan receive]:
runtime.gopark(0xc0001417c0?, 0xc0011a5890?, 0x60?, 0x47?, 0x563186c4e9c8?)
	runtime/proc.go:435 +0xce fp=0xc000074718 sp=0xc0000746f8 pc=0x563186b69d4e
runtime.chanrecv(0xc00004a380, 0x0, 0x1)
	runtime/chan.go:664 +0x445 fp=0xc000074790 sp=0xc000074718 pc=0x563186b05725
runtime.chanrecv1(0x0?, 0x0?)
	runtime/chan.go:506 +0x12 fp=0xc0000747b8 sp=0xc000074790 pc=0x563186b052b2
runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
	runtime/mgc.go:1796
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
	runtime/mgc.go:1799 +0x2f fp=0xc0000747e0 sp=0xc0000747b8 pc=0x563186b17f2f
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000747e8 sp=0xc0000747e0 pc=0x563186b71481
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
	runtime/mgc.go:1794 +0x85

goroutine 7 gp=0xc000160c40 m=nil [GC worker (idle)]:
runtime.gopark(0x51b366a14d49?, 0x3?, 0xc6?, 0x56?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000074f38 sp=0xc000074f18 pc=0x563186b69d4e
runtime.gcBgMarkWorker(0xc00004b7a0)
	runtime/mgc.go:1423 +0xe9 fp=0xc000074fc8 sp=0xc000074f38 pc=0x563186b17249
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1339 +0x25 fp=0xc000074fe0 sp=0xc000074fc8 pc=0x563186b17125
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x563186b71481
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1339 +0x105

goroutine 686 gp=0xc000161180 m=nil [IO wait, 2 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?)
	runtime/proc.go:435 +0xce fp=0xc00006fdd8 sp=0xc00006fdb8 pc=0x563186b69d4e
runtime.netpollblock(0x563186b8d0b8?, 0x86b02b46?, 0x31?)
	runtime/netpoll.go:575 +0xf7 fp=0xc00006fe10 sp=0xc00006fdd8 pc=0x563186b2e837
internal/poll.runtime_pollWait(0x149c035c7d98, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc00006fe30 sp=0xc00006fe10 pc=0x563186b68f65
internal/poll.(*pollDesc).wait(0xc0003f2100?, 0xc0003fcf41?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00006fe58 sp=0xc00006fe30 pc=0x563186bf03a7
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0003f2100, {0xc0003fcf41, 0x1, 0x1})
	internal/poll/fd_unix.go:165 +0x27a fp=0xc00006fef0 sp=0xc00006fe58 pc=0x563186bf169a
net.(*netFD).Read(0xc0003f2100, {0xc0003fcf41?, 0xc00045c0d8?, 0xc00006ff70?})
	net/fd_posix.go:55 +0x25 fp=0xc00006ff38 sp=0xc00006fef0 pc=0x563186c65de5
net.(*conn).Read(0xc00005e098, {0xc0003fcf41?, 0x56318705ab6e?, 0xc000409680?})
	net/net.go:194 +0x45 fp=0xc00006ff80 sp=0xc00006ff38 pc=0x563186c741a5
net/http.(*connReader).backgroundRead(0xc0003fcf30)
	net/http/server.go:690 +0x37 fp=0xc00006ffc8 sp=0xc00006ff80 pc=0x563186e60017
net/http.(*connReader).startBackgroundRead.gowrap2()
	net/http/server.go:686 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x563186e5ff45
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x563186b71481
created by net/http.(*connReader).startBackgroundRead in goroutine 10
	net/http/server.go:686 +0xb6

goroutine 10 gp=0xc000161500 m=nil [select, 2 minutes]:
runtime.gopark(0xc000057a10?, 0x2?, 0x0?, 0x79?, 0xc000057874?)
	runtime/proc.go:435 +0xce fp=0xc0001a16a0 sp=0xc0001a1680 pc=0x563186b69d4e
runtime.selectgo(0xc0001a1a10, 0xc000057870, 0x1f400?, 0x0, 0x4?, 0x1)
	runtime/select.go:351 +0x837 fp=0xc0001a17d8 sp=0xc0001a16a0 pc=0x563186b483b7
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc000409680, {0x563187edd288, 0xc000272a80}, 0xc0001b1540)
	github.com/ollama/ollama/runner/ollamarunner/runner.go:680 +0xb65 fp=0xc0001a1ac0 sp=0xc0001a17d8 pc=0x56318705d3c5
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x563187edd288?, 0xc000272a80?}, 0xc0001a1b40?)
	<autogenerated>:1 +0x36 fp=0xc0001a1af0 sp=0xc0001a1ac0 pc=0x563187060796
net/http.HandlerFunc.ServeHTTP(0xc00041f200?, {0x563187edd288?, 0xc000272a80?}, 0xc0001a1b60?)
	net/http/server.go:2294 +0x29 fp=0xc0001a1b18 sp=0xc0001a1af0 pc=0x563186e67c49
net/http.(*ServeMux).ServeHTTP(0x563186b0e265?, {0x563187edd288, 0xc000272a80}, 0xc0001b1540)
	net/http/server.go:2822 +0x1c4 fp=0xc0001a1b68 sp=0xc0001a1b18 pc=0x563186e69b44
net/http.serverHandler.ServeHTTP({0x563187ed98d0?}, {0x563187edd288?, 0xc000272a80?}, 0x1?)
	net/http/server.go:3301 +0x8e fp=0xc0001a1b98 sp=0xc0001a1b68 pc=0x563186e875ce
net/http.(*conn).serve(0xc00047a3f0, {0x563187edf518, 0xc0003fd320})
	net/http/server.go:2102 +0x625 fp=0xc0001a1fb8 sp=0xc0001a1b98 pc=0x563186e66145
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3454 +0x28 fp=0xc0001a1fe0 sp=0xc0001a1fb8 pc=0x563186e6ba08
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0001a1fe8 sp=0xc0001a1fe0 pc=0x563186b71481
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3454 +0x485

rax    0x0
rbx    0x6
rcx    0x149c0207352f
rdx    0x0
rdi    0x2
rsi    0x7ffdff767430
rbp    0x149b5f7140e5
rsp    0x7ffdff767430
r8     0x0
r9     0x7ffdff767430
r10    0x8
r11    0x246
r12    0x149b5f714648
r13    0x4d
r14    0x149bbac3f4f8
r15    0x563189ed8420
rip    0x149c0207352f
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
time=2025-08-15T16:26:26.417-07:00 level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:46171/completion\": EOF"
[GIN] 2025/08/15 - 16:26:26 | 500 |         3m34s |       127.0.0.1 | POST     "/api/chat"
time=2025-08-15T16:26:26.556-07:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 2"
time=2025-08-15T16:26:32.797-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.116413847 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216894 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-08-15T16:26:33.760-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=6.080232927 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216894 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-08-15T16:26:34.759-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=7.078477012 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216894 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-08-15T16:26:35.803-07:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 gpu=GPU-2691ebd7-3aea-1611-a8d3-12b456a33b86 parallel=1 available=149467627520 required="113.2 GiB"
time=2025-08-15T16:26:36.979-07:00 level=INFO source=server.go:135 msg="system memory" total="1510.4 GiB" free="1497.8 GiB" free_swap="0 B"
time=2025-08-15T16:26:36.979-07:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="" memory.available="[139.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="113.2 GiB" memory.required.partial="113.2 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[113.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB"
time=2025-08-15T16:26:37.016-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="venv/bin/ollama runner --ollama-engine --model .ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 37 --threads 128 --parallel 1 --port 41021"
time=2025-08-15T16:26:37.020-07:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-15T16:26:37.020-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-15T16:26:37.020-07:00 level=WARN source=server.go:605 msg="client connection closed before server finished loading, aborting load"
time=2025-08-15T16:26:37.070-07:00 level=ERROR source=sched.go:487 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
time=2025-08-15T16:26:37.084-07:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-15T16:26:37.084-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:41021"
[GIN] 2025/08/15 - 16:26:37 | 499 |  9.468164032s |       127.0.0.1 | POST     "/api/chat"
time=2025-08-15T16:26:37.163-07:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA H200 NVL, compute capability 9.0, VMM: yes
load_backend: loaded CUDA backend from venv/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from venv/lib/ollama/libggml-cpu-icelake.so
time=2025-08-15T16:26:37.987-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-15T16:26:38.313-07:00 level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU"
time=2025-08-15T16:26:38.313-07:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-15T16:26:38.313-07:00 level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU"
time=2025-08-15T16:26:38.313-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="59.7 GiB"
time=2025-08-15T16:26:38.313-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-08-15T16:26:38.340-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="31.5 GiB"
time=2025-08-15T16:26:38.340-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB"
time=2025-08-15T16:26:42.884-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.814082541 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=217328 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-08-15T16:26:43.923-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=6.853227123 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=217328 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-08-15T16:26:44.925-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=7.855429753 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=217328 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
[GIN] 2025/08/15 - 16:27:26 | 200 |      64.428µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/15 - 16:27:27 | 200 |   75.000746ms |       127.0.0.1 | POST     "/api/create"
time=2025-08-15T16:29:22.326-07:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 gpu=GPU-2691ebd7-3aea-1611-a8d3-12b456a33b86 parallel=1 available=149467627520 required="102.0 GiB"
time=2025-08-15T16:29:23.295-07:00 level=INFO source=server.go:135 msg="system memory" total="1510.4 GiB" free="1496.3 GiB" free_swap="0 B"
time=2025-08-15T16:29:23.295-07:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="" memory.available="[139.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="102.0 GiB" memory.required.partial="102.0 GiB" memory.required.kv="3.6 GiB" memory.required.allocations="[102.0 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="36.6 GiB" memory.graph.partial="36.6 GiB"
time=2025-08-15T16:29:23.333-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="venv/bin/ollama runner --ollama-engine --model.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 100000 --batch-size 512 --n-gpu-layers 37 --threads 128 --parallel 1 --port 34591"
time=2025-08-15T16:29:23.337-07:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-15T16:29:23.337-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-15T16:29:23.338-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-15T16:29:23.373-07:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-15T16:29:23.374-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:34591"
time=2025-08-15T16:29:23.411-07:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
time=2025-08-15T16:29:23.589-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-15T16:29:24.040-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-15T16:29:24.291-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA H200 NVL, compute capability 9.0, VMM: yes
load_backend: loaded CUDA backend from venv/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from venv/lib/ollama/libggml-cpu-icelake.so
time=2025-08-15T16:29:25.282-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-15T16:29:25.443-07:00 level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU"
time=2025-08-15T16:29:25.443-07:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-15T16:29:25.443-07:00 level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU"
time=2025-08-15T16:29:25.443-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="59.7 GiB"
time=2025-08-15T16:29:25.443-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-08-15T16:29:25.451-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="24.6 GiB"
time=2025-08-15T16:29:25.451-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB"
time=2025-08-15T16:29:33.827-07:00 level=WARN source=server.go:605 msg="client connection closed before server finished loading, aborting load"
time=2025-08-15T16:29:33.827-07:00 level=ERROR source=sched.go:487 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
[GIN] 2025/08/15 - 16:29:33 | 499 | 12.820856779s |       127.0.0.1 | POST     "/api/chat"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.11.4

Originally created by @jsearcy1 on GitHub (Aug 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11930 ### What is the issue? Hi All, I'm trying to run some large context text through gpt-oss:120b, I created a new model with PARAMETER num_ctx 128000, but when I try anything that long I get the cuda crash in the log below (CUDA error No symbol table is loaded.). I'm running on an H200, and still have about 40GB of memory free when the error occurs (according to nvidia-smi). I'm guessing there might be a 16-bit integer overflowing, because I can run successfully at 65K but it fails at 70K. Thanks for taking a look. ### Relevant log output ```shell time=2025-08-15T16:15:41.248-07:00 level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1,2,3 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-08-15T16:15:41.457-07:00 level=INFO source=images.go:477 msg="total blobs: 44" time=2025-08-15T16:15:41.458-07:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-08-15T16:15:41.459-07:00 level=INFO source=routes.go:1357 msg="Listening on 127.0.0.1:11434 (version 0.11.4)" time=2025-08-15T16:15:41.459-07:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-15T16:15:46.740-07:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-2691ebd7-3aea-1611-a8d3-12b456a33b86 library=cuda variant=v12 compute=9.0 driver=12.4 name="NVIDIA H200 NVL" total="139.7 GiB" available="139.2 GiB" time=2025-08-15T16:15:46.740-07:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-2f9f9c52-a998-be7f-da5f-2e896bf6c402 library=cuda variant=v12 compute=9.0 driver=12.4 name="NVIDIA H200 NVL" total="139.7 GiB" available="139.2 GiB" time=2025-08-15T16:15:46.740-07:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-1084798f-4308-377a-2b42-496b5f767a32 library=cuda variant=v12 compute=9.0 driver=12.4 name="NVIDIA H200 NVL" total="139.7 GiB" available="139.2 GiB" time=2025-08-15T16:15:46.740-07:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-23ca7ec3-a9fa-1140-0e40-d54592d4225b library=cuda variant=v12 compute=9.0 driver=12.4 name="NVIDIA H200 NVL" total="139.7 GiB" available="139.2 GiB" time=2025-08-15T16:17:49.128-07:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 gpu=GPU-2691ebd7-3aea-1611-a8d3-12b456a33b86 parallel=1 available=149467627520 required="113.2 GiB" time=2025-08-15T16:17:50.116-07:00 level=INFO source=server.go:135 msg="system memory" total="1510.4 GiB" free="1497.9 GiB" free_swap="0 B" time=2025-08-15T16:17:50.117-07:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="" memory.available="[139.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="113.2 GiB" memory.required.partial="113.2 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[113.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB" time=2025-08-15T16:17:50.152-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/venv/bin/ollama runner --ollama-engine --model .ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 37 --threads 128 --parallel 1 --port 40311" time=2025-08-15T16:17:50.156-07:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-15T16:17:50.156-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-15T16:17:50.159-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-15T16:17:50.162-07:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-15T16:17:50.162-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:40311" time=2025-08-15T16:17:50.218-07:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 time=2025-08-15T16:17:50.412-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-15T16:17:50.903-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-15T16:17:51.654-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H200 NVL, compute capability 9.0, VMM: yes load_backend: loaded CUDA backend from venv/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from venv/lib/ollama/libggml-cpu-icelake.so time=2025-08-15T16:17:53.280-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-15T16:17:53.436-07:00 level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU" time=2025-08-15T16:17:53.436-07:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU" time=2025-08-15T16:17:53.436-07:00 level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU" time=2025-08-15T16:17:53.436-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="59.7 GiB" time=2025-08-15T16:17:53.436-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-08-15T16:17:53.445-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="31.5 GiB" time=2025-08-15T16:17:53.445-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB" time=2025-08-15T16:18:45.654-07:00 level=INFO source=server.go:637 msg="llama runner started in 55.42 seconds" [GIN] 2025/08/15 - 16:20:04 | 200 | 66.940872ms | 127.0.0.1 | POST "/api/show" time=2025-08-15T16:20:15.976-07:00 level=WARN source=runner.go:157 msg="truncating input prompt" limit=128000 prompt=299074 keep=4 new=128000 ggml_cuda_compute_forward: SCALE failed CUDA error: invalid configuration argument current device: 0, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2377 err //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:77: CUDA error No symbol table is loaded. Use the "file" command. [New LWP 216260] [New LWP 216261] [New LWP 216262] [New LWP 216265] [New LWP 216268] [New LWP 216269] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x000056470106084e in ?? () No symbol "frame" in current context. [Inferior 1 (process 216259) detached] SIGABRT: abort PC=0x14bc86cb552f m=5 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 9 gp=0xc000161880 m=5 mp=0xc000258008 [syscall]: runtime.cgocall(0x564701d93ec0, 0xc0004e3a58) runtime/cgocall.go:167 +0x4b fp=0xc0004e3a30 sp=0xc0004e39f8 pc=0x5647010c68cb github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x564704ce6930, 0x14bba4dfe790) _cgo_gotypes.go:886 +0x4a fp=0xc0004e3a58 sp=0xc0004e3a30 pc=0x564701500c6a github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute.func1(...) github.com/ollama/ollama/ml/backend/ggml/ggml.go:627 github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute(0xc000256000, {0xc00027c620, 0x1, 0x0?}) github.com/ollama/ollama/ml/backend/ggml/ggml.go:627 +0x9d fp=0xc0004e3b00 sp=0xc0004e3a58 pc=0x56470150c17d github.com/ollama/ollama/model.Forward({0x564702447850, 0xc000256000}, {0x56470243e070, 0xc0000c96b0}, {0xc0011c6800, 0x200, 0x200}, {{0x5647024524a8, 0xc0019ea048}, {0x0, ...}, ...}) github.com/ollama/ollama/model/model.go:305 +0x2a7 fp=0xc0004e3be8 sp=0xc0004e3b00 pc=0x56470151a147 github.com/ollama/ollama/runner/ollamarunner.(*Server).processBatch(0xc00041f9e0) github.com/ollama/ollama/runner/ollamarunner/runner.go:480 +0x4c5 fp=0xc0004e3f98 sp=0xc0004e3be8 pc=0x5647015bb085 github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc00041f9e0, {0x56470243f550, 0xc000437e00}) github.com/ollama/ollama/runner/ollamarunner/runner.go:362 +0x4e fp=0xc0004e3fb8 sp=0xc0004e3f98 pc=0x5647015bab6e github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap2() github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0x28 fp=0xc0004e3fe0 sp=0xc0004e3fb8 pc=0x5647015c02c8 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0004e3fe8 sp=0xc0004e3fe0 pc=0x5647010d1481 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0xa74 goroutine 1 gp=0xc000002380 m=nil [IO wait, 5 minutes]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc0004e5650 sp=0xc0004e5630 pc=0x5647010c9d4e runtime.netpollblock(0xc0000576a0?, 0x1062b46?, 0x47?) runtime/netpoll.go:575 +0xf7 fp=0xc0004e5688 sp=0xc0004e5650 pc=0x56470108e837 internal/poll.runtime_pollWait(0x14bc88209eb0, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc0004e56a8 sp=0xc0004e5688 pc=0x5647010c8f65 internal/poll.(*pollDesc).wait(0xc00048d500?, 0x900000036?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0004e56d0 sp=0xc0004e56a8 pc=0x5647011503a7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc00048d500) internal/poll/fd_unix.go:620 +0x295 fp=0xc0004e5778 sp=0xc0004e56d0 pc=0x564701155775 net.(*netFD).accept(0xc00048d500) net/fd_unix.go:172 +0x29 fp=0xc0004e5830 sp=0xc0004e5778 pc=0x5647011c7d89 net.(*TCPListener).accept(0xc000461d00) net/tcpsock_posix.go:159 +0x1b fp=0xc0004e5880 sp=0xc0004e5830 pc=0x5647011dd73b net.(*TCPListener).Accept(0xc000461d00) net/tcpsock.go:380 +0x30 fp=0xc0004e58b0 sp=0xc0004e5880 pc=0x5647011dc5f0 net/http.(*onceCloseListener).Accept(0xc000230360?) <autogenerated>:1 +0x24 fp=0xc0004e58c8 sp=0xc0004e58b0 pc=0x5647013f3d44 net/http.(*Server).Serve(0xc0004b6d00, {0x56470243d0a8, 0xc000461d00}) net/http/server.go:3424 +0x30c fp=0xc0004e59f8 sp=0xc0004e58c8 pc=0x5647013cb60c github.com/ollama/ollama/runner/ollamarunner.Execute({0xc000034150, 0xe, 0xf}) github.com/ollama/ollama/runner/ollamarunner/runner.go:984 +0xe09 fp=0xc0004e5d08 sp=0xc0004e59f8 pc=0x5647015c0029 github.com/ollama/ollama/runner.Execute({0xc000034130?, 0x0?, 0x0?}) github.com/ollama/ollama/runner/runner.go:20 +0xc9 fp=0xc0004e5d30 sp=0xc0004e5d08 pc=0x5647015c0929 github.com/ollama/ollama/cmd.NewCLI.func2(0xc0004b6b00?, {0x564701f8007e?, 0x4?, 0x564701f80082?}) github.com/ollama/ollama/cmd/cmd.go:1583 +0x45 fp=0xc0004e5d58 sp=0xc0004e5d30 pc=0x564701d25e25 github.com/spf13/cobra.(*Command).execute(0xc0004bb508, {0xc0003090e0, 0xf, 0xf}) github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc0004e5e78 sp=0xc0004e5d58 pc=0x5647012413dc github.com/spf13/cobra.(*Command).ExecuteC(0xc000446f08) github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc0004e5f30 sp=0xc0004e5e78 pc=0x564701241c25 github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.7.0/command.go:992 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.7.0/command.go:985 main.main() github.com/ollama/ollama/main.go:12 +0x4d fp=0xc0004e5f50 sp=0xc0004e5f30 pc=0x564701d2690d runtime.main() runtime/proc.go:283 +0x29d fp=0xc0004e5fe0 sp=0xc0004e5f50 pc=0x564701095ebd runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0004e5fe8 sp=0xc0004e5fe0 pc=0x5647010d1481 goroutine 2 gp=0xc000002e00 m=nil [force gc (idle), 5 minutes]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x5647010c9d4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.forcegchelper() runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x5647010961f8 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x5647010d1481 created by runtime.init.7 in goroutine 1 runtime/proc.go:336 +0x1a goroutine 3 gp=0xc000002fc0 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000073780 sp=0xc000073760 pc=0x5647010c9d4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.bgsweep(0xc000046080) runtime/mgcsweep.go:316 +0xdf fp=0xc0000737c8 sp=0xc000073780 pc=0x56470108099f runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x564701074d85 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x5647010d1481 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000003180 m=nil [GC scavenge wait]: runtime.gopark(0x10000?, 0x432835?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000073f78 sp=0xc000073f58 pc=0x5647010c9d4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.(*scavengerState).park(0x564702cd49a0) runtime/mgcscavenge.go:425 +0x49 fp=0xc000073fa8 sp=0xc000073f78 pc=0x56470107e3e9 runtime.bgscavenge(0xc000046080) runtime/mgcscavenge.go:658 +0x59 fp=0xc000073fc8 sp=0xc000073fa8 pc=0x56470107e979 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x564701074d25 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x5647010d1481 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 5 gp=0xc000003a40 m=nil [finalizer wait, 5 minutes]: runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000072688?) runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x5647010c9d4e runtime.runfinq() runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x564701073d47 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x5647010d1481 created by runtime.createfing in goroutine 1 runtime/mfinal.go:166 +0x3d goroutine 6 gp=0xc000160540 m=nil [chan receive]: runtime.gopark(0xc0000ed9a0?, 0xc0019ea018?, 0x60?, 0x5f?, 0x5647011ae9c8?) runtime/proc.go:435 +0xce fp=0xc000085f18 sp=0xc000085ef8 pc=0x5647010c9d4e runtime.chanrecv(0xc00004a380, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc000085f90 sp=0xc000085f18 pc=0x564701065725 runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:506 +0x12 fp=0xc000085fb8 sp=0xc000085f90 pc=0x5647010652b2 runtime.unique_runtime_registerUniqueMapCleanup.func2(...) runtime/mgc.go:1796 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1799 +0x2f fp=0xc000085fe0 sp=0xc000085fb8 pc=0x564701077f2f runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x5647010d1481 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1794 +0x85 goroutine 7 gp=0xc000160e00 m=nil [GC worker (idle)]: runtime.gopark(0x51814e92dd7b?, 0x3?, 0xe9?, 0xaf?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000074f38 sp=0xc000074f18 pc=0x5647010c9d4e runtime.gcBgMarkWorker(0xc00004b960) runtime/mgc.go:1423 +0xe9 fp=0xc000074fc8 sp=0xc000074f38 pc=0x564701077249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000074fe0 sp=0xc000074fc8 pc=0x564701077125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x5647010d1481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 709 gp=0xc0001616c0 m=nil [IO wait, 2 minutes]: runtime.gopark(0x5?, 0x14bc88382108?, 0x60?, 0x0?, 0xb?) runtime/proc.go:435 +0xce fp=0xc000075dd8 sp=0xc000075db8 pc=0x5647010c9d4e runtime.netpollblock(0x5647010ed0b8?, 0x1062b46?, 0x47?) runtime/netpoll.go:575 +0xf7 fp=0xc000075e10 sp=0xc000075dd8 pc=0x56470108e837 internal/poll.runtime_pollWait(0x14bc88209c80, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc000075e30 sp=0xc000075e10 pc=0x5647010c8f65 internal/poll.(*pollDesc).wait(0xc00048c700?, 0xc0004b53f1?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000075e58 sp=0xc000075e30 pc=0x5647011503a7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc00048c700, {0xc0004b53f1, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc000075ef0 sp=0xc000075e58 pc=0x56470115169a net.(*netFD).Read(0xc00048c700, {0xc0004b53f1?, 0xc000460218?, 0xc000075f70?}) net/fd_posix.go:55 +0x25 fp=0xc000075f38 sp=0xc000075ef0 pc=0x5647011c5de5 net.(*conn).Read(0xc00005e2e8, {0xc0004b53f1?, 0x5647015bab6e?, 0xc00041f9e0?}) net/net.go:194 +0x45 fp=0xc000075f80 sp=0xc000075f38 pc=0x5647011d41a5 net/http.(*connReader).backgroundRead(0xc0004b53e0) net/http/server.go:690 +0x37 fp=0xc000075fc8 sp=0xc000075f80 pc=0x5647013c0017 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc000075fe0 sp=0xc000075fc8 pc=0x5647013bff45 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000075fe8 sp=0xc000075fe0 pc=0x5647010d1481 created by net/http.(*connReader).startBackgroundRead in goroutine 14 net/http/server.go:686 +0xb6 goroutine 14 gp=0xc000160700 m=nil [select, 2 minutes]: runtime.gopark(0xc0004e7a10?, 0x2?, 0x0?, 0xfa?, 0xc0004e7874?) runtime/proc.go:435 +0xce fp=0xc0004e76a0 sp=0xc0004e7680 pc=0x5647010c9d4e runtime.selectgo(0xc0004e7a10, 0xc0004e7870, 0x1f400?, 0x0, 0x4?, 0x1) runtime/select.go:351 +0x837 fp=0xc0004e77d8 sp=0xc0004e76a0 pc=0x5647010a83b7 github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc00041f9e0, {0x56470243d288, 0xc003fee1c0}, 0xc0004743c0) github.com/ollama/ollama/runner/ollamarunner/runner.go:680 +0xb65 fp=0xc0004e7ac0 sp=0xc0004e77d8 pc=0x5647015bd3c5 github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x56470243d288?, 0xc003fee1c0?}, 0xc0004e7b40?) <autogenerated>:1 +0x36 fp=0xc0004e7af0 sp=0xc0004e7ac0 pc=0x5647015c0796 net/http.HandlerFunc.ServeHTTP(0xc00040d680?, {0x56470243d288?, 0xc003fee1c0?}, 0xc0004e7b60?) net/http/server.go:2294 +0x29 fp=0xc0004e7b18 sp=0xc0004e7af0 pc=0x5647013c7c49 net/http.(*ServeMux).ServeHTTP(0x56470106e265?, {0x56470243d288, 0xc003fee1c0}, 0xc0004743c0) net/http/server.go:2822 +0x1c4 fp=0xc0004e7b68 sp=0xc0004e7b18 pc=0x5647013c9b44 net/http.serverHandler.ServeHTTP({0x5647024398d0?}, {0x56470243d288?, 0xc003fee1c0?}, 0x1?) net/http/server.go:3301 +0x8e fp=0xc0004e7b98 sp=0xc0004e7b68 pc=0x5647013e75ce net/http.(*conn).serve(0xc000230360, {0x56470243f518, 0xc0004b4ea0}) net/http/server.go:2102 +0x625 fp=0xc0004e7fb8 sp=0xc0004e7b98 pc=0x5647013c6145 net/http.(*Server).Serve.gowrap3() net/http/server.go:3454 +0x28 fp=0xc0004e7fe0 sp=0xc0004e7fb8 pc=0x5647013cba08 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0004e7fe8 sp=0xc0004e7fe0 pc=0x5647010d1481 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3454 +0x485 rax 0x0 rbx 0x6 rcx 0x14bc86cb552f rdx 0x0 rdi 0x2 rsi 0x14bc3fb828e0 rbp 0x14bbe37140e5 rsp 0x14bc3fb828e0 r8 0x0 r9 0x14bc3fb828e0 r10 0x8 r11 0x246 r12 0x14bbe3714648 r13 0x4d r14 0x14bc3f6904f8 r15 0x564704ce62d0 rip 0x14bc86cb552f rflags 0x246 cs 0x33 fs 0x0 gs 0x0 time=2025-08-15T16:22:51.487-07:00 level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:40311/completion\": EOF" [GIN] 2025/08/15 - 16:22:51 | 500 | 2m35s | 127.0.0.1 | POST "/api/chat" time=2025-08-15T16:22:51.669-07:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 2" time=2025-08-15T16:22:57.871-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.953556091 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216259 runner.model=/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 time=2025-08-15T16:22:59.399-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=7.481591271 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216259 runner.model=/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 time=2025-08-15T16:23:00.939-07:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 gpu=GPU-2691ebd7-3aea-1611-a8d3-12b456a33b86 parallel=1 available=149467627520 required="113.2 GiB" time=2025-08-15T16:23:01.947-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=10.029605582 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216259 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 time=2025-08-15T16:23:03.457-07:00 level=INFO source=server.go:135 msg="system memory" total="1510.4 GiB" free="1497.8 GiB" free_swap="0 B" time=2025-08-15T16:23:03.459-07:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="" memory.available="[139.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="113.2 GiB" memory.required.partial="113.2 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[113.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB" time=2025-08-15T16:23:03.498-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="venv/bin/ollama runner --ollama-engine --model .ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 37 --threads 128 --parallel 1 --port 46171" time=2025-08-15T16:23:03.502-07:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-15T16:23:03.502-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-15T16:23:03.505-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-15T16:23:03.506-07:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-15T16:23:03.506-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:46171" time=2025-08-15T16:23:03.543-07:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 time=2025-08-15T16:23:03.756-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H200 NVL, compute capability 9.0, VMM: yes load_backend: loaded CUDA backend from venv/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from venv/lib/ollama/libggml-cpu-icelake.so time=2025-08-15T16:23:04.066-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-15T16:23:04.418-07:00 level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU" time=2025-08-15T16:23:04.418-07:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU" time=2025-08-15T16:23:04.418-07:00 level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU" time=2025-08-15T16:23:04.418-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="59.7 GiB" time=2025-08-15T16:23:04.418-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-08-15T16:23:04.448-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="31.5 GiB" time=2025-08-15T16:23:04.448-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB" time=2025-08-15T16:23:55.354-07:00 level=INFO source=server.go:637 msg="llama runner started in 51.85 seconds" time=2025-08-15T16:23:55.813-07:00 level=WARN source=runner.go:157 msg="truncating input prompt" limit=128000 prompt=299074 keep=4 new=128000 ggml_cuda_compute_forward: SCALE failed CUDA error: invalid configuration argument current device: 0, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2377 err //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:77: CUDA error No symbol table is loaded. Use the "file" command. [New LWP 216895] [New LWP 216896] [New LWP 216897] [New LWP 216898] [New LWP 216899] [New LWP 216900] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x0000149c03149312 in waitpid () from /lib64/libpthread.so.0 [Inferior 1 (process 216894) detached] No symbol "frame" in current context. SIGABRT: abort PC=0x149c0207352f m=0 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 9 gp=0xc000161340 m=0 mp=0x5631887777c0 [syscall]: runtime.cgocall(0x563187833ec0, 0xc00019da58) runtime/cgocall.go:167 +0x4b fp=0xc00019da30 sp=0xc00019d9f8 pc=0x563186b668cb github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x563189ed8ac0, 0x56318a32fad0) _cgo_gotypes.go:886 +0x4a fp=0xc00019da58 sp=0xc00019da30 pc=0x563186fa0c6a github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute.func1(...) github.com/ollama/ollama/ml/backend/ggml/ggml.go:627 github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute(0xc00045c300, {0xc0011924b0, 0x1, 0x0?}) github.com/ollama/ollama/ml/backend/ggml/ggml.go:627 +0x9d fp=0xc00019db00 sp=0xc00019da58 pc=0x563186fac17d github.com/ollama/ollama/model.Forward({0x563187ee7850, 0xc00045c300}, {0x563187ede070, 0xc0000c96b0}, {0xc0018e3000, 0x200, 0x200}, {{0x563187ef24a8, 0xc0011a4000}, {0x0, ...}, ...}) github.com/ollama/ollama/model/model.go:305 +0x2a7 fp=0xc00019dbe8 sp=0xc00019db00 pc=0x563186fba147 github.com/ollama/ollama/runner/ollamarunner.(*Server).processBatch(0xc000409680) github.com/ollama/ollama/runner/ollamarunner/runner.go:480 +0x4c5 fp=0xc00019df98 sp=0xc00019dbe8 pc=0x56318705b085 github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc000409680, {0x563187edf550, 0xc00043de00}) github.com/ollama/ollama/runner/ollamarunner/runner.go:362 +0x4e fp=0xc00019dfb8 sp=0xc00019df98 pc=0x56318705ab6e github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap2() github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0x28 fp=0xc00019dfe0 sp=0xc00019dfb8 pc=0x5631870602c8 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00019dfe8 sp=0xc00019dfe0 pc=0x563186b71481 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0xa74 goroutine 1 gp=0xc000002380 m=nil [IO wait, 3 minutes]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00019f650 sp=0xc00019f630 pc=0x563186b69d4e runtime.netpollblock(0xc00019f6a0?, 0x86b02b46?, 0x31?) runtime/netpoll.go:575 +0xf7 fp=0xc00019f688 sp=0xc00019f650 pc=0x563186b2e837 internal/poll.runtime_pollWait(0x149c035c7eb0, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc00019f6a8 sp=0xc00019f688 pc=0x563186b68f65 internal/poll.(*pollDesc).wait(0xc0003f2080?, 0x900b0ce3e?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00019f6d0 sp=0xc00019f6a8 pc=0x563186bf03a7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc0003f2080) internal/poll/fd_unix.go:620 +0x295 fp=0xc00019f778 sp=0xc00019f6d0 pc=0x563186bf5775 net.(*netFD).accept(0xc0003f2080) net/fd_unix.go:172 +0x29 fp=0xc00019f830 sp=0xc00019f778 pc=0x563186c67d89 net.(*TCPListener).accept(0xc00045dd00) net/tcpsock_posix.go:159 +0x1b fp=0xc00019f880 sp=0xc00019f830 pc=0x563186c7d73b net.(*TCPListener).Accept(0xc00045dd00) net/tcpsock.go:380 +0x30 fp=0xc00019f8b0 sp=0xc00019f880 pc=0x563186c7c5f0 net/http.(*onceCloseListener).Accept(0xc00047a3f0?) <autogenerated>:1 +0x24 fp=0xc00019f8c8 sp=0xc00019f8b0 pc=0x563186e93d44 net/http.(*Server).Serve(0xc000163400, {0x563187edd0a8, 0xc00045dd00}) net/http/server.go:3424 +0x30c fp=0xc00019f9f8 sp=0xc00019f8c8 pc=0x563186e6b60c github.com/ollama/ollama/runner/ollamarunner.Execute({0xc000034150, 0xe, 0xf}) github.com/ollama/ollama/runner/ollamarunner/runner.go:984 +0xe09 fp=0xc00019fd08 sp=0xc00019f9f8 pc=0x563187060029 github.com/ollama/ollama/runner.Execute({0xc000034130?, 0x0?, 0x0?}) github.com/ollama/ollama/runner/runner.go:20 +0xc9 fp=0xc00019fd30 sp=0xc00019fd08 pc=0x563187060929 github.com/ollama/ollama/cmd.NewCLI.func2(0xc000163200?, {0x563187a2007e?, 0x4?, 0x563187a20082?}) github.com/ollama/ollama/cmd/cmd.go:1583 +0x45 fp=0xc00019fd58 sp=0xc00019fd30 pc=0x5631877c5e25 github.com/spf13/cobra.(*Command).execute(0xc00047cf08, {0xc0003065a0, 0xf, 0xf}) github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc00019fe78 sp=0xc00019fd58 pc=0x563186ce13dc github.com/spf13/cobra.(*Command).ExecuteC(0xc000434f08) github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc00019ff30 sp=0xc00019fe78 pc=0x563186ce1c25 github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.7.0/command.go:992 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.7.0/command.go:985 main.main() github.com/ollama/ollama/main.go:12 +0x4d fp=0xc00019ff50 sp=0xc00019ff30 pc=0x5631877c690d runtime.main() runtime/proc.go:283 +0x29d fp=0xc00019ffe0 sp=0xc00019ff50 pc=0x563186b35ebd runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00019ffe8 sp=0xc00019ffe0 pc=0x563186b71481 goroutine 2 gp=0xc000002e00 m=nil [force gc (idle), 3 minutes]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x563186b69d4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.forcegchelper() runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x563186b361f8 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x563186b71481 created by runtime.init.7 in goroutine 1 runtime/proc.go:336 +0x1a goroutine 3 gp=0xc000002fc0 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000073780 sp=0xc000073760 pc=0x563186b69d4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.bgsweep(0xc000046080) runtime/mgcsweep.go:316 +0xdf fp=0xc0000737c8 sp=0xc000073780 pc=0x563186b2099f runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x563186b14d85 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x563186b71481 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000003180 m=nil [GC scavenge wait]: runtime.gopark(0x10000?, 0xfe742?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000073f78 sp=0xc000073f58 pc=0x563186b69d4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.(*scavengerState).park(0x5631887749a0) runtime/mgcscavenge.go:425 +0x49 fp=0xc000073fa8 sp=0xc000073f78 pc=0x563186b1e3e9 runtime.bgscavenge(0xc000046080) runtime/mgcscavenge.go:658 +0x59 fp=0xc000073fc8 sp=0xc000073fa8 pc=0x563186b1e979 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x563186b14d25 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x563186b71481 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 5 gp=0xc000003a40 m=nil [finalizer wait, 3 minutes]: runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000072688?) runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x563186b69d4e runtime.runfinq() runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x563186b13d47 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x563186b71481 created by runtime.createfing in goroutine 1 runtime/mfinal.go:166 +0x3d goroutine 6 gp=0xc000160540 m=nil [chan receive]: runtime.gopark(0xc0001417c0?, 0xc0011a5890?, 0x60?, 0x47?, 0x563186c4e9c8?) runtime/proc.go:435 +0xce fp=0xc000074718 sp=0xc0000746f8 pc=0x563186b69d4e runtime.chanrecv(0xc00004a380, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc000074790 sp=0xc000074718 pc=0x563186b05725 runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:506 +0x12 fp=0xc0000747b8 sp=0xc000074790 pc=0x563186b052b2 runtime.unique_runtime_registerUniqueMapCleanup.func2(...) runtime/mgc.go:1796 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1799 +0x2f fp=0xc0000747e0 sp=0xc0000747b8 pc=0x563186b17f2f runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000747e8 sp=0xc0000747e0 pc=0x563186b71481 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1794 +0x85 goroutine 7 gp=0xc000160c40 m=nil [GC worker (idle)]: runtime.gopark(0x51b366a14d49?, 0x3?, 0xc6?, 0x56?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000074f38 sp=0xc000074f18 pc=0x563186b69d4e runtime.gcBgMarkWorker(0xc00004b7a0) runtime/mgc.go:1423 +0xe9 fp=0xc000074fc8 sp=0xc000074f38 pc=0x563186b17249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000074fe0 sp=0xc000074fc8 pc=0x563186b17125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x563186b71481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 686 gp=0xc000161180 m=nil [IO wait, 2 minutes]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?) runtime/proc.go:435 +0xce fp=0xc00006fdd8 sp=0xc00006fdb8 pc=0x563186b69d4e runtime.netpollblock(0x563186b8d0b8?, 0x86b02b46?, 0x31?) runtime/netpoll.go:575 +0xf7 fp=0xc00006fe10 sp=0xc00006fdd8 pc=0x563186b2e837 internal/poll.runtime_pollWait(0x149c035c7d98, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc00006fe30 sp=0xc00006fe10 pc=0x563186b68f65 internal/poll.(*pollDesc).wait(0xc0003f2100?, 0xc0003fcf41?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00006fe58 sp=0xc00006fe30 pc=0x563186bf03a7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc0003f2100, {0xc0003fcf41, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc00006fef0 sp=0xc00006fe58 pc=0x563186bf169a net.(*netFD).Read(0xc0003f2100, {0xc0003fcf41?, 0xc00045c0d8?, 0xc00006ff70?}) net/fd_posix.go:55 +0x25 fp=0xc00006ff38 sp=0xc00006fef0 pc=0x563186c65de5 net.(*conn).Read(0xc00005e098, {0xc0003fcf41?, 0x56318705ab6e?, 0xc000409680?}) net/net.go:194 +0x45 fp=0xc00006ff80 sp=0xc00006ff38 pc=0x563186c741a5 net/http.(*connReader).backgroundRead(0xc0003fcf30) net/http/server.go:690 +0x37 fp=0xc00006ffc8 sp=0xc00006ff80 pc=0x563186e60017 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x563186e5ff45 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x563186b71481 created by net/http.(*connReader).startBackgroundRead in goroutine 10 net/http/server.go:686 +0xb6 goroutine 10 gp=0xc000161500 m=nil [select, 2 minutes]: runtime.gopark(0xc000057a10?, 0x2?, 0x0?, 0x79?, 0xc000057874?) runtime/proc.go:435 +0xce fp=0xc0001a16a0 sp=0xc0001a1680 pc=0x563186b69d4e runtime.selectgo(0xc0001a1a10, 0xc000057870, 0x1f400?, 0x0, 0x4?, 0x1) runtime/select.go:351 +0x837 fp=0xc0001a17d8 sp=0xc0001a16a0 pc=0x563186b483b7 github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc000409680, {0x563187edd288, 0xc000272a80}, 0xc0001b1540) github.com/ollama/ollama/runner/ollamarunner/runner.go:680 +0xb65 fp=0xc0001a1ac0 sp=0xc0001a17d8 pc=0x56318705d3c5 github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x563187edd288?, 0xc000272a80?}, 0xc0001a1b40?) <autogenerated>:1 +0x36 fp=0xc0001a1af0 sp=0xc0001a1ac0 pc=0x563187060796 net/http.HandlerFunc.ServeHTTP(0xc00041f200?, {0x563187edd288?, 0xc000272a80?}, 0xc0001a1b60?) net/http/server.go:2294 +0x29 fp=0xc0001a1b18 sp=0xc0001a1af0 pc=0x563186e67c49 net/http.(*ServeMux).ServeHTTP(0x563186b0e265?, {0x563187edd288, 0xc000272a80}, 0xc0001b1540) net/http/server.go:2822 +0x1c4 fp=0xc0001a1b68 sp=0xc0001a1b18 pc=0x563186e69b44 net/http.serverHandler.ServeHTTP({0x563187ed98d0?}, {0x563187edd288?, 0xc000272a80?}, 0x1?) net/http/server.go:3301 +0x8e fp=0xc0001a1b98 sp=0xc0001a1b68 pc=0x563186e875ce net/http.(*conn).serve(0xc00047a3f0, {0x563187edf518, 0xc0003fd320}) net/http/server.go:2102 +0x625 fp=0xc0001a1fb8 sp=0xc0001a1b98 pc=0x563186e66145 net/http.(*Server).Serve.gowrap3() net/http/server.go:3454 +0x28 fp=0xc0001a1fe0 sp=0xc0001a1fb8 pc=0x563186e6ba08 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0001a1fe8 sp=0xc0001a1fe0 pc=0x563186b71481 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3454 +0x485 rax 0x0 rbx 0x6 rcx 0x149c0207352f rdx 0x0 rdi 0x2 rsi 0x7ffdff767430 rbp 0x149b5f7140e5 rsp 0x7ffdff767430 r8 0x0 r9 0x7ffdff767430 r10 0x8 r11 0x246 r12 0x149b5f714648 r13 0x4d r14 0x149bbac3f4f8 r15 0x563189ed8420 rip 0x149c0207352f rflags 0x246 cs 0x33 fs 0x0 gs 0x0 time=2025-08-15T16:26:26.417-07:00 level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:46171/completion\": EOF" [GIN] 2025/08/15 - 16:26:26 | 500 | 3m34s | 127.0.0.1 | POST "/api/chat" time=2025-08-15T16:26:26.556-07:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 2" time=2025-08-15T16:26:32.797-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.116413847 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216894 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 time=2025-08-15T16:26:33.760-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=6.080232927 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216894 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 time=2025-08-15T16:26:34.759-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=7.078477012 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=216894 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 time=2025-08-15T16:26:35.803-07:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 gpu=GPU-2691ebd7-3aea-1611-a8d3-12b456a33b86 parallel=1 available=149467627520 required="113.2 GiB" time=2025-08-15T16:26:36.979-07:00 level=INFO source=server.go:135 msg="system memory" total="1510.4 GiB" free="1497.8 GiB" free_swap="0 B" time=2025-08-15T16:26:36.979-07:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="" memory.available="[139.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="113.2 GiB" memory.required.partial="113.2 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[113.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB" time=2025-08-15T16:26:37.016-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="venv/bin/ollama runner --ollama-engine --model .ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 37 --threads 128 --parallel 1 --port 41021" time=2025-08-15T16:26:37.020-07:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-15T16:26:37.020-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-15T16:26:37.020-07:00 level=WARN source=server.go:605 msg="client connection closed before server finished loading, aborting load" time=2025-08-15T16:26:37.070-07:00 level=ERROR source=sched.go:487 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" time=2025-08-15T16:26:37.084-07:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-15T16:26:37.084-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:41021" [GIN] 2025/08/15 - 16:26:37 | 499 | 9.468164032s | 127.0.0.1 | POST "/api/chat" time=2025-08-15T16:26:37.163-07:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H200 NVL, compute capability 9.0, VMM: yes load_backend: loaded CUDA backend from venv/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from venv/lib/ollama/libggml-cpu-icelake.so time=2025-08-15T16:26:37.987-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-15T16:26:38.313-07:00 level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU" time=2025-08-15T16:26:38.313-07:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU" time=2025-08-15T16:26:38.313-07:00 level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU" time=2025-08-15T16:26:38.313-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="59.7 GiB" time=2025-08-15T16:26:38.313-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-08-15T16:26:38.340-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="31.5 GiB" time=2025-08-15T16:26:38.340-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB" time=2025-08-15T16:26:42.884-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.814082541 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=217328 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 time=2025-08-15T16:26:43.923-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=6.853227123 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=217328 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 time=2025-08-15T16:26:44.925-07:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=7.855429753 runner.size="113.2 GiB" runner.vram="113.2 GiB" runner.parallel=1 runner.pid=217328 runner.model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 [GIN] 2025/08/15 - 16:27:26 | 200 | 64.428µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/15 - 16:27:27 | 200 | 75.000746ms | 127.0.0.1 | POST "/api/create" time=2025-08-15T16:29:22.326-07:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 gpu=GPU-2691ebd7-3aea-1611-a8d3-12b456a33b86 parallel=1 available=149467627520 required="102.0 GiB" time=2025-08-15T16:29:23.295-07:00 level=INFO source=server.go:135 msg="system memory" total="1510.4 GiB" free="1496.3 GiB" free_swap="0 B" time=2025-08-15T16:29:23.295-07:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="" memory.available="[139.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="102.0 GiB" memory.required.partial="102.0 GiB" memory.required.kv="3.6 GiB" memory.required.allocations="[102.0 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="36.6 GiB" memory.graph.partial="36.6 GiB" time=2025-08-15T16:29:23.333-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="venv/bin/ollama runner --ollama-engine --model.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 100000 --batch-size 512 --n-gpu-layers 37 --threads 128 --parallel 1 --port 34591" time=2025-08-15T16:29:23.337-07:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-15T16:29:23.337-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-15T16:29:23.338-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-15T16:29:23.373-07:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-15T16:29:23.374-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:34591" time=2025-08-15T16:29:23.411-07:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 time=2025-08-15T16:29:23.589-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-15T16:29:24.040-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-15T16:29:24.291-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H200 NVL, compute capability 9.0, VMM: yes load_backend: loaded CUDA backend from venv/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from venv/lib/ollama/libggml-cpu-icelake.so time=2025-08-15T16:29:25.282-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-15T16:29:25.443-07:00 level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU" time=2025-08-15T16:29:25.443-07:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU" time=2025-08-15T16:29:25.443-07:00 level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU" time=2025-08-15T16:29:25.443-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="59.7 GiB" time=2025-08-15T16:29:25.443-07:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-08-15T16:29:25.451-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="24.6 GiB" time=2025-08-15T16:29:25.451-07:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB" time=2025-08-15T16:29:33.827-07:00 level=WARN source=server.go:605 msg="client connection closed before server finished loading, aborting load" time=2025-08-15T16:29:33.827-07:00 level=ERROR source=sched.go:487 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" [GIN] 2025/08/15 - 16:29:33 | 499 | 12.820856779s | 127.0.0.1 | POST "/api/chat" ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.11.4
GiteaMirror added the bug label 2026-04-29 05:55:54 -05:00
Author
Owner

@jsearcy1 commented on GitHub (Aug 25, 2025):

Looks like this is runs fine in 0.11.6.

<!-- gh-comment-id:3221325827 --> @jsearcy1 commented on GitHub (Aug 25, 2025): Looks like this is runs fine in 0.11.6.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54430