[GH-ISSUE #10041] gemma EOF error on image input due to improper memory management #32346

Closed
opened 2026-04-22 13:31:37 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @Master-Pr0grammer on GitHub (Mar 29, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10041

What is the issue?

Description:

When running gemma 3 with image inputs on some systems, the model crashes with a EOF error due to improper memory management.

Setup:

I am running gemma3:12b with on a system with a gtx 1080ti and a gtx 1050ti. when loading the model, ollama splits the model on to the two different GPU's, (see the attached nvidia-smi file for vram usage). As you can see, I still have a significant amount of VRAM left on the 1080ti (~4.5Gb), but I only have ~500Mb on the 1050ti.

Possible Explanation:

I am not super familiar with the backend of ollama, but I am going to take a guess that it has to do with the GPU split, and not properly pre-allocating memory, as evident by this line in the error log:

Mar 29 11:58:38 watson ollama[1408032]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1195.28 MiB on device 1: cudaMalloc failed: out of memory

It is trying to allocate more memory after the model is already loaded, onto the device with less remaining memory (1050ti). again, I don't know anything about the backend, but I feel like the context length should be pre-allocated, and adding images would simply take up more of this pre-allocated context, and it shouldn't try to allocate more after the fact in order to prevent these kinds of crashes.

Reproducability:

Python script I used to produce error:

import ollama

client = ollama.Client(host='http://192.168.50.221:11434/')
response = client.chat(
    model='gemma3:12b',
    messages=[
        {'role':'system', 'content':'You are a helpful AI assistant'},
        {'role':'user', 'content':'what is in this image?', 'images':['Screenshot 2025-03-13 003328.png']}
    ]
)

print(response.message.content)

Error Output:

Traceback (most recent call last):
  File "\Desktop\test.py", line 4, in <module>
    response = client.chat(
               ^^^^^^^^^^^^
  File "...\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ollama\_client.py", line 333, in chat
    return self._request(
           ^^^^^^^^^^^^^^
  File "...\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ollama\_client.py", line 178, in _request
    return cls(**self._request_raw(*args, **kwargs).json())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ollama\_client.py", line 122, in _request_raw
    raise ResponseError(e.response.text, e.response.status_code) from None
ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:33823/completion": EOF (status code: 500)

The image I used was just a screenshot of my screen at standard 1080p resolution.
I hope this helps, but I doubt you will be able to reproduce this error without a system that uses GPU split with one or two GPU's low on VRAM.

Relevant Files:

nvidia-smi.txt

Relevant log output

Mar 29 11:58:35 watson ollama[1408032]: time=2025-03-29T11:58:35.421-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.227315301 model=/usr/share/ollama/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3
Mar 29 11:58:35 watson ollama[1408032]: time=2025-03-29T11:58:35.672-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.477852175 model=/usr/share/ollama/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3
Mar 29 11:58:35 watson ollama[1408032]: time=2025-03-29T11:58:35.891-04:00 level=INFO source=sched.go:731 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 library=cuda parallel=4 required="14.3 GiB"
Mar 29 11:58:35 watson ollama[1408032]: time=2025-03-29T11:58:35.951-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.756531494 model=/usr/share/ollama/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.135-04:00 level=INFO source=server.go:105 msg="system memory" total="15.6 GiB" free="14.2 GiB" free_swap="2.9 GiB"
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.138-04:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split=38,11 memory.available="[10.6 GiB 3.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.3 GiB" memory.required.partial="14.3 GiB" memory.required.kv="1.9 GiB" memory.required.allocations="[10.5 GiB 3.7 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.261-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.268-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.271-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 4 --no-mmap --parallel 4 --tensor-split 38,11 --port 33823"
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.281-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.281-04:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.281-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.294-04:00 level=INFO source=runner.go:765 msg="starting ollama engine"
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.297-04:00 level=INFO source=runner.go:828 msg="Server listening on 127.0.0.1:33823"
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.429-04:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.429-04:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.429-04:00 level=INFO source=ggml.go:69 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=36
Mar 29 11:58:36 watson ollama[1408032]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 29 11:58:36 watson ollama[1408032]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 29 11:58:36 watson ollama[1408032]: ggml_cuda_init: found 2 CUDA devices:
Mar 29 11:58:36 watson ollama[1408032]:   Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Mar 29 11:58:36 watson ollama[1408032]:   Device 1: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.532-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Mar 29 11:58:36 watson ollama[1408032]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 29 11:58:36 watson ollama[1408032]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.560-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.631-04:00 level=INFO source=ggml.go:291 msg="model weights" buffer=CUDA1 size="2.9 GiB"
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.631-04:00 level=INFO source=ggml.go:291 msg="model weights" buffer=CPU size="787.5 MiB"
Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.631-04:00 level=INFO source=ggml.go:291 msg="model weights" buffer=CUDA0 size="4.7 GiB"
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.090-04:00 level=INFO source=ggml.go:383 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.090-04:00 level=INFO source=ggml.go:383 msg="compute graph" backend=CUDA1 buffer_type=CUDA1
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.090-04:00 level=INFO source=ggml.go:383 msg="compute graph" backend=CPU buffer_type=CUDA_Host
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.091-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.093-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.096-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.103-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.103-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.103-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.103-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.103-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.291-04:00 level=INFO source=server.go:619 msg="llama runner started in 2.01 seconds"
Mar 29 11:58:38 watson ollama[1408032]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1195.28 MiB on device 1: cudaMalloc failed: out of memory
Mar 29 11:58:38 watson ollama[1408032]: ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1253346304
Mar 29 11:58:38 watson ollama[1408032]: SIGSEGV: segmentation violation
Mar 29 11:58:38 watson ollama[1408032]: PC=0x59fdcdf550c0 m=11 sigcode=1 addr=0x58
Mar 29 11:58:38 watson ollama[1408032]: signal arrived during cgo execution
Mar 29 11:58:38 watson ollama[1408032]: goroutine 20 gp=0xc0001c7880 m=11 mp=0xc000307808 [syscall]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.cgocall(0x59fdcdfa90d0, 0xc000229aa8)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/cgocall.go:167 +0x4b fp=0xc000229a80 sp=0xc000229a48 pc=0x59fdcd17496b
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x7a2d38003a70, 0x7a303a400d50)
Mar 29 11:58:38 watson ollama[1408032]:         _cgo_gotypes.go:485 +0x4a fp=0xc000229aa8 sp=0xc000229a80 pc=0x59fdcd56e9aa
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/ml/backend/ggml.Context.Compute.func1(...)
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/ml/backend/ggml/ggml.go:524
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/ml/backend/ggml.Context.Compute({0xc000412200, 0x7a2d38003940, 0x7a303a400d50, 0x0, 0x2000}, {0xc0031266f0, 0x1, 0xc00258a120?})
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/ml/backend/ggml/ggml.go:524 +0xbd fp=0xc000229b38 sp=0xc000229aa8 pc=0x59fdcd5774fd
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute(0xc0030f44b0?, {0xc0031266f0?, 0xc00258a090?, 0x1?})
Mar 29 11:58:38 watson ollama[1408032]:         <autogenerated>:1 +0x72 fp=0xc000229bb0 sp=0xc000229b38 pc=0x59fdcd57cf72
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/model.Forward({0x59fdce45c600, 0xc0030f44b0}, {0x59fdce453c90, 0xc00030a0e0}, {0xc0030c1000, 0x11e, 0x200}, {{0x59fdce464c10, 0xc00258a0a8}, {0xc00258a090, ...}, ...})
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/model/model.go:312 +0x2b8 fp=0xc000229c90 sp=0xc000229bb0 pc=0x59fdcd5a41f8
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.(*Server).processBatch(0xc00053b9e0)
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:424 +0x3fe fp=0xc000229f98 sp=0xc000229c90 pc=0x59fdcd62775e
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc00053b9e0, {0x59fdce454fc0, 0xc000531db0})
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:336 +0x4e fp=0xc000229fb8 sp=0xc000229f98 pc=0x59fdcd62730e
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap2()
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:805 +0x28 fp=0xc000229fe0 sp=0xc000229fb8 pc=0x59fdcd62b708
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000229fe8 sp=0xc000229fe0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:805 +0xb37
Mar 29 11:58:38 watson ollama[1408032]: goroutine 1 gp=0xc000002380 m=nil [IO wait]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc000225628 sp=0xc000225608 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.netpollblock(0xc000225678?, 0xcd111426?, 0xfd?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/netpoll.go:575 +0xf7 fp=0xc000225660 sp=0xc000225628 pc=0x59fdcd13ca57
Mar 29 11:58:38 watson ollama[1408032]: internal/poll.runtime_pollWait(0x7a30a2a6deb0, 0x72)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/netpoll.go:351 +0x85 fp=0xc000225680 sp=0xc000225660 pc=0x59fdcd176e85
Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*pollDesc).wait(0xc00050f300?, 0x900000036?, 0x0)
Mar 29 11:58:38 watson ollama[1408032]:         internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0002256a8 sp=0xc000225680 pc=0x59fdcd1fe307
Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*pollDesc).waitRead(...)
Mar 29 11:58:38 watson ollama[1408032]:         internal/poll/fd_poll_runtime.go:89
Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*FD).Accept(0xc00050f300)
Mar 29 11:58:38 watson ollama[1408032]:         internal/poll/fd_unix.go:620 +0x295 fp=0xc000225750 sp=0xc0002256a8 pc=0x59fdcd2036d5
Mar 29 11:58:38 watson ollama[1408032]: net.(*netFD).accept(0xc00050f300)
Mar 29 11:58:38 watson ollama[1408032]:         net/fd_unix.go:172 +0x29 fp=0xc000225808 sp=0xc000225750 pc=0x59fdcd2764e9
Mar 29 11:58:38 watson ollama[1408032]: net.(*TCPListener).accept(0xc00053c000)
Mar 29 11:58:38 watson ollama[1408032]:         net/tcpsock_posix.go:159 +0x1b fp=0xc000225858 sp=0xc000225808 pc=0x59fdcd28be9b
Mar 29 11:58:38 watson ollama[1408032]: net.(*TCPListener).Accept(0xc00053c000)
Mar 29 11:58:38 watson ollama[1408032]:         net/tcpsock.go:380 +0x30 fp=0xc000225888 sp=0xc000225858 pc=0x59fdcd28ad50
Mar 29 11:58:38 watson ollama[1408032]: net/http.(*onceCloseListener).Accept(0xc000126360?)
Mar 29 11:58:38 watson ollama[1408032]:         <autogenerated>:1 +0x24 fp=0xc0002258a0 sp=0xc000225888 pc=0x59fdcd4a2384
Mar 29 11:58:38 watson ollama[1408032]: net/http.(*Server).Serve(0xc0001e8e00, {0x59fdce452cf8, 0xc00053c000})
Mar 29 11:58:38 watson ollama[1408032]:         net/http/server.go:3424 +0x30c fp=0xc0002259d0 sp=0xc0002258a0 pc=0x59fdcd479c4c
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.Execute({0xc000034170, 0x11, 0x11})
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:829 +0xec9 fp=0xc000225d08 sp=0xc0002259d0 pc=0x59fdcd62b469
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner.Execute({0xc000034150?, 0x0?, 0x0?})
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/runner/runner.go:20 +0xc9 fp=0xc000225d30 sp=0xc000225d08 pc=0x59fdcd62c0e9
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/cmd.NewCLI.func2(0xc0001e9200?, {0x59fdcdfc4055?, 0x4?, 0x59fdcdfc4059?})
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/cmd/cmd.go:1329 +0x45 fp=0xc000225d58 sp=0xc000225d30 pc=0x59fdcdd79b25
Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra.(*Command).execute(0xc000128f08, {0xc00053b7a0, 0x12, 0x12})
Mar 29 11:58:38 watson ollama[1408032]:         github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000225e78 sp=0xc000225d58 pc=0x59fdcd2efb3c
Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra.(*Command).ExecuteC(0xc0004a2f08)
Mar 29 11:58:38 watson ollama[1408032]:         github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000225f30 sp=0xc000225e78 pc=0x59fdcd2f0385
Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra.(*Command).Execute(...)
Mar 29 11:58:38 watson ollama[1408032]:         github.com/spf13/cobra@v1.7.0/command.go:992
Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra.(*Command).ExecuteContext(...)
Mar 29 11:58:38 watson ollama[1408032]:         github.com/spf13/cobra@v1.7.0/command.go:985
Mar 29 11:58:38 watson ollama[1408032]: main.main()
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000225f50 sp=0xc000225f30 pc=0x59fdcdd79e8d
Mar 29 11:58:38 watson ollama[1408032]: runtime.main()
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:283 +0x29d fp=0xc000225fe0 sp=0xc000225f50 pc=0x59fdcd14405d
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000225fe8 sp=0xc000225fe0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc000064fa8 sp=0xc000064f88 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.goparkunlock(...)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:441
Mar 29 11:58:38 watson ollama[1408032]: runtime.forcegchelper()
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:348 +0xb8 fp=0xc000064fe0 sp=0xc000064fa8 pc=0x59fdcd144398
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000064fe8 sp=0xc000064fe0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by runtime.init.7 in goroutine 1
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:336 +0x1a
Mar 29 11:58:38 watson ollama[1408032]: goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc000065780 sp=0xc000065760 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.goparkunlock(...)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:441
Mar 29 11:58:38 watson ollama[1408032]: runtime.bgsweep(0xc00007e000)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgcsweep.go:316 +0xdf fp=0xc0000657c8 sp=0xc000065780 pc=0x59fdcd12ea5f
Mar 29 11:58:38 watson ollama[1408032]: runtime.gcenable.gowrap1()
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:204 +0x25 fp=0xc0000657e0 sp=0xc0000657c8 pc=0x59fdcd122e45
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000657e8 sp=0xc0000657e0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcenable in goroutine 1
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:204 +0x66
Mar 29 11:58:38 watson ollama[1408032]: goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x1285dfdc?, 0x127e78d7?, 0x0?, 0x0?, 0x0?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc000065f78 sp=0xc000065f58 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.goparkunlock(...)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:441
Mar 29 11:58:38 watson ollama[1408032]: runtime.(*scavengerState).park(0x59fdcecbb1c0)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgcscavenge.go:425 +0x49 fp=0xc000065fa8 sp=0xc000065f78 pc=0x59fdcd12c4a9
Mar 29 11:58:38 watson ollama[1408032]: runtime.bgscavenge(0xc00007e000)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgcscavenge.go:658 +0x59 fp=0xc000065fc8 sp=0xc000065fa8 pc=0x59fdcd12ca39
Mar 29 11:58:38 watson ollama[1408032]: runtime.gcenable.gowrap2()
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:205 +0x25 fp=0xc000065fe0 sp=0xc000065fc8 pc=0x59fdcd122de5
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000065fe8 sp=0xc000065fe0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcenable in goroutine 1
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:205 +0xa5
Mar 29 11:58:38 watson ollama[1408032]: goroutine 5 gp=0xc000003dc0 m=nil [finalizer wait]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000064688?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc000064630 sp=0xc000064610 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.runfinq()
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mfinal.go:196 +0x107 fp=0xc0000647e0 sp=0xc000064630 pc=0x59fdcd121e07
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000647e8 sp=0xc0000647e0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by runtime.createfing in goroutine 1
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mfinal.go:166 +0x3d
Mar 29 11:58:38 watson ollama[1408032]: goroutine 6 gp=0xc0001c68c0 m=nil [chan receive]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0xc0002197c0?, 0xc0030fb0e0?, 0x60?, 0x67?, 0x59fdcd25d228?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc000066718 sp=0xc0000666f8 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.chanrecv(0xc0000423f0, 0x0, 0x1)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/chan.go:664 +0x445 fp=0xc000066790 sp=0xc000066718 pc=0x59fdcd114005
Mar 29 11:58:38 watson ollama[1408032]: runtime.chanrecv1(0x0?, 0x0?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/chan.go:506 +0x12 fp=0xc0000667b8 sp=0xc000066790 pc=0x59fdcd113b92
Mar 29 11:58:38 watson ollama[1408032]: runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1796
Mar 29 11:58:38 watson ollama[1408032]: runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1799 +0x2f fp=0xc0000667e0 sp=0xc0000667b8 pc=0x59fdcd125fef
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000667e8 sp=0xc0000667e0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by unique.runtime_registerUniqueMapCleanup in goroutine 1
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1794 +0x85
Mar 29 11:58:38 watson ollama[1408032]: goroutine 7 gp=0xc0001c7180 m=nil [GC worker (idle)]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x22a48fd53c1d7?, 0x3?, 0xf3?, 0x4f?, 0x0?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc000066f38 sp=0xc000066f18 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkWorker(0xc0000439d0)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1423 +0xe9 fp=0xc000066fc8 sp=0xc000066f38 pc=0x59fdcd125309
Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkStartWorkers.gowrap1()
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1339 +0x25 fp=0xc000066fe0 sp=0xc000066fc8 pc=0x59fdcd1251e5
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000066fe8 sp=0xc000066fe0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1339 +0x105
Mar 29 11:58:38 watson ollama[1408032]: goroutine 8 gp=0xc0001c7340 m=nil [GC worker (idle)]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x22a4912efa1bd?, 0x3?, 0x8f?, 0xd0?, 0x0?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc000067738 sp=0xc000067718 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkWorker(0xc0000439d0)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1423 +0xe9 fp=0xc0000677c8 sp=0xc000067738 pc=0x59fdcd125309
Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkStartWorkers.gowrap1()
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1339 +0x25 fp=0xc0000677e0 sp=0xc0000677c8 pc=0x59fdcd1251e5
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000677e8 sp=0xc0000677e0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1339 +0x105
Mar 29 11:58:38 watson ollama[1408032]: goroutine 9 gp=0xc0001c7500 m=nil [GC worker (idle)]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x22a4912efbbf0?, 0x3?, 0x3f?, 0xc4?, 0x0?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc000067f38 sp=0xc000067f18 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkWorker(0xc0000439d0)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1423 +0xe9 fp=0xc000067fc8 sp=0xc000067f38 pc=0x59fdcd125309
Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkStartWorkers.gowrap1()
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1339 +0x25 fp=0xc000067fe0 sp=0xc000067fc8 pc=0x59fdcd1251e5
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1339 +0x105
Mar 29 11:58:38 watson ollama[1408032]: goroutine 18 gp=0xc000102380 m=nil [GC worker (idle)]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x22a4912efa23c?, 0x1?, 0xc5?, 0xfb?, 0x0?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc000060738 sp=0xc000060718 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkWorker(0xc0000439d0)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1423 +0xe9 fp=0xc0000607c8 sp=0xc000060738 pc=0x59fdcd125309
Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkStartWorkers.gowrap1()
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1339 +0x25 fp=0xc0000607e0 sp=0xc0000607c8 pc=0x59fdcd1251e5
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000607e8 sp=0xc0000607e0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcBgMarkStartWorkers in goroutine 1
Mar 29 11:58:38 watson ollama[1408032]:         runtime/mgc.go:1339 +0x105
Mar 29 11:58:38 watson ollama[1408032]: goroutine 10 gp=0xc0001c6700 m=nil [select]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0xc000049a08?, 0x2?, 0x0?, 0xc6?, 0xc000049864?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc000049678 sp=0xc000049658 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.selectgo(0xc000049a08, 0xc000049860, 0x11e?, 0x0, 0x4?, 0x1)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/select.go:351 +0x837 fp=0xc0000497b0 sp=0xc000049678 pc=0x59fdcd156557
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc00053b9e0, {0x59fdce452ed8, 0xc0052440e0}, 0xc003021e00)
Mar 29 11:58:38 watson ollama[1408032]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:621 +0xae5 fp=0xc000049ac0 sp=0xc0000497b0 pc=0x59fdcd6298c5
Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x59fdce452ed8?, 0xc0052440e0?}, 0xc000223b40?)
Mar 29 11:58:38 watson ollama[1408032]:         <autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x59fdcd62bf56
Mar 29 11:58:38 watson ollama[1408032]: net/http.HandlerFunc.ServeHTTP(0xc000550000?, {0x59fdce452ed8?, 0xc0052440e0?}, 0xc000223b60?)
Mar 29 11:58:38 watson ollama[1408032]:         net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x59fdcd476289
Mar 29 11:58:38 watson ollama[1408032]: net/http.(*ServeMux).ServeHTTP(0x59fdcd11c325?, {0x59fdce452ed8, 0xc0052440e0}, 0xc003021e00)
Mar 29 11:58:38 watson ollama[1408032]:         net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x59fdcd478184
Mar 29 11:58:38 watson ollama[1408032]: net/http.serverHandler.ServeHTTP({0x59fdce44f570?}, {0x59fdce452ed8?, 0xc0052440e0?}, 0x1?)
Mar 29 11:58:38 watson ollama[1408032]:         net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x59fdcd495c0e
Mar 29 11:58:38 watson ollama[1408032]: net/http.(*conn).serve(0xc000126360, {0x59fdce454f88, 0xc0000ebd40})
Mar 29 11:58:38 watson ollama[1408032]:         net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x59fdcd474785
Mar 29 11:58:38 watson ollama[1408032]: net/http.(*Server).Serve.gowrap3()
Mar 29 11:58:38 watson ollama[1408032]:         net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x59fdcd47a048
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by net/http.(*Server).Serve in goroutine 1
Mar 29 11:58:38 watson ollama[1408032]:         net/http/server.go:3454 +0x485
Mar 29 11:58:38 watson ollama[1408032]: goroutine 1071 gp=0xc0001c7c00 m=nil [IO wait]:
Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x0?, 0xc000061768?, 0x93?, 0x5f?, 0xb?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/proc.go:435 +0xce fp=0xc0000615d8 sp=0xc0000615b8 pc=0x59fdcd177c6e
Mar 29 11:58:38 watson ollama[1408032]: runtime.netpollblock(0x59fdcd19b0f8?, 0xcd111426?, 0xfd?)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/netpoll.go:575 +0xf7 fp=0xc000061610 sp=0xc0000615d8 pc=0x59fdcd13ca57
Mar 29 11:58:38 watson ollama[1408032]: internal/poll.runtime_pollWait(0x7a30a2a6dd98, 0x72)
Mar 29 11:58:38 watson ollama[1408032]:         runtime/netpoll.go:351 +0x85 fp=0xc000061630 sp=0xc000061610 pc=0x59fdcd176e85
Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*pollDesc).wait(0xc00050e100?, 0xc0002120a1?, 0x0)
Mar 29 11:58:38 watson ollama[1408032]:         internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000061658 sp=0xc000061630 pc=0x59fdcd1fe307
Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*pollDesc).waitRead(...)
Mar 29 11:58:38 watson ollama[1408032]:         internal/poll/fd_poll_runtime.go:89
Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*FD).Read(0xc00050e100, {0xc0002120a1, 0x1, 0x1})
Mar 29 11:58:38 watson ollama[1408032]:         internal/poll/fd_unix.go:165 +0x27a fp=0xc0000616f0 sp=0xc000061658 pc=0x59fdcd1ff5fa
Mar 29 11:58:38 watson ollama[1408032]: net.(*netFD).Read(0xc00050e100, {0xc0002120a1?, 0x59fdcd56d689?, 0xc000061770?})
Mar 29 11:58:38 watson ollama[1408032]:         net/fd_posix.go:55 +0x25 fp=0xc000061738 sp=0xc0000616f0 pc=0x59fdcd274545
Mar 29 11:58:38 watson ollama[1408032]: net.(*conn).Read(0xc000068208, {0xc0002120a1?, 0xc0030471c0?, 0x59fdcd56d640?})
Mar 29 11:58:38 watson ollama[1408032]:         net/net.go:194 +0x45 fp=0xc000061780 sp=0xc000061738 pc=0x59fdcd282905
Mar 29 11:58:38 watson ollama[1408032]: net/http.(*connReader).backgroundRead(0xc000212090)
Mar 29 11:58:38 watson ollama[1408032]:         net/http/server.go:690 +0x37 fp=0xc0000617c8 sp=0xc000061780 pc=0x59fdcd46e657
Mar 29 11:58:38 watson ollama[1408032]: net/http.(*connReader).startBackgroundRead.gowrap2()
Mar 29 11:58:38 watson ollama[1408032]:         net/http/server.go:686 +0x25 fp=0xc0000617e0 sp=0xc0000617c8 pc=0x59fdcd46e585
Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({})
Mar 29 11:58:38 watson ollama[1408032]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000617e8 sp=0xc0000617e0 pc=0x59fdcd17f3a1
Mar 29 11:58:38 watson ollama[1408032]: created by net/http.(*connReader).startBackgroundRead in goroutine 10
Mar 29 11:58:38 watson ollama[1408032]:         net/http/server.go:686 +0xb6
Mar 29 11:58:38 watson ollama[1408032]: rax    0x7a2d3837e8e0
Mar 29 11:58:38 watson ollama[1408032]: rbx    0x7a2d3837e850
Mar 29 11:58:38 watson ollama[1408032]: rcx    0x3
Mar 29 11:58:38 watson ollama[1408032]: rdx    0x7a2d38521b50
Mar 29 11:58:38 watson ollama[1408032]: rdi    0x0
Mar 29 11:58:38 watson ollama[1408032]: rsi    0x7a2fa0a00030
Mar 29 11:58:38 watson ollama[1408032]: rbp    0x7a2d38521b48
Mar 29 11:58:38 watson ollama[1408032]: rsp    0x7a303b3ffc48
Mar 29 11:58:38 watson ollama[1408032]: r8     0x4
Mar 29 11:58:38 watson ollama[1408032]: r9     0xc000068048
Mar 29 11:58:38 watson ollama[1408032]: r10    0x1
Mar 29 11:58:38 watson ollama[1408032]: r11    0x216
Mar 29 11:58:38 watson ollama[1408032]: r12    0x1
Mar 29 11:58:38 watson ollama[1408032]: r13    0x7a2d38003bc8
Mar 29 11:58:38 watson ollama[1408032]: r14    0xc2f
Mar 29 11:58:38 watson ollama[1408032]: r15    0x7a2d3837e850
Mar 29 11:58:38 watson ollama[1408032]: rip    0x59fdcdf550c0
Mar 29 11:58:38 watson ollama[1408032]: rflags 0x10206
Mar 29 11:58:38 watson ollama[1408032]: cs     0x33
Mar 29 11:58:38 watson ollama[1408032]: fs     0x0
Mar 29 11:58:38 watson ollama[1408032]: gs     0x0
Mar 29 11:58:38 watson ollama[1408032]: [GIN] 2025/03/29 - 11:58:38 | 500 |  8.841319466s |  192.168.50.221 | POST     "/api/chat"
Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.730-04:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 2"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.6.3

Originally created by @Master-Pr0grammer on GitHub (Mar 29, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10041 ### What is the issue? ### Description: When running gemma 3 with image inputs on some systems, the model crashes with a EOF error due to improper memory management. ### Setup: I am running gemma3:12b with on a system with a gtx 1080ti and a gtx 1050ti. when loading the model, ollama splits the model on to the two different GPU's, (see the attached nvidia-smi file for vram usage). As you can see, I still have a significant amount of VRAM left on the 1080ti (~4.5Gb), but I only have ~500Mb on the 1050ti. ### Possible Explanation: I am not super familiar with the backend of ollama, but I am going to take a guess that it has to do with the GPU split, and not properly pre-allocating memory, as evident by this line in the error log: `Mar 29 11:58:38 watson ollama[1408032]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1195.28 MiB on device 1: cudaMalloc failed: out of memory` It is trying to allocate more memory after the model is already loaded, onto the device with less remaining memory (1050ti). again, I don't know anything about the backend, but I feel like the context length should be pre-allocated, and adding images would simply take up more of this pre-allocated context, and it shouldn't try to allocate more after the fact in order to prevent these kinds of crashes. ### Reproducability: Python script I used to produce error: ```python3 import ollama client = ollama.Client(host='http://192.168.50.221:11434/') response = client.chat( model='gemma3:12b', messages=[ {'role':'system', 'content':'You are a helpful AI assistant'}, {'role':'user', 'content':'what is in this image?', 'images':['Screenshot 2025-03-13 003328.png']} ] ) print(response.message.content) ``` Error Output: ``` Traceback (most recent call last): File "\Desktop\test.py", line 4, in <module> response = client.chat( ^^^^^^^^^^^^ File "...\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ollama\_client.py", line 333, in chat return self._request( ^^^^^^^^^^^^^^ File "...\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ollama\_client.py", line 178, in _request return cls(**self._request_raw(*args, **kwargs).json()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "...\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ollama\_client.py", line 122, in _request_raw raise ResponseError(e.response.text, e.response.status_code) from None ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:33823/completion": EOF (status code: 500) ``` The image I used was just a screenshot of my screen at standard 1080p resolution. I hope this helps, but I doubt you will be able to reproduce this error without a system that uses GPU split with one or two GPU's low on VRAM. ### Relevant Files: [nvidia-smi.txt](https://github.com/user-attachments/files/19521995/nvidia-smi.txt) ### Relevant log output ```shell Mar 29 11:58:35 watson ollama[1408032]: time=2025-03-29T11:58:35.421-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.227315301 model=/usr/share/ollama/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 Mar 29 11:58:35 watson ollama[1408032]: time=2025-03-29T11:58:35.672-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.477852175 model=/usr/share/ollama/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 Mar 29 11:58:35 watson ollama[1408032]: time=2025-03-29T11:58:35.891-04:00 level=INFO source=sched.go:731 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 library=cuda parallel=4 required="14.3 GiB" Mar 29 11:58:35 watson ollama[1408032]: time=2025-03-29T11:58:35.951-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.756531494 model=/usr/share/ollama/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.135-04:00 level=INFO source=server.go:105 msg="system memory" total="15.6 GiB" free="14.2 GiB" free_swap="2.9 GiB" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.138-04:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split=38,11 memory.available="[10.6 GiB 3.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.3 GiB" memory.required.partial="14.3 GiB" memory.required.kv="1.9 GiB" memory.required.allocations="[10.5 GiB 3.7 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.261-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.268-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.271-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.280-04:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 4 --no-mmap --parallel 4 --tensor-split 38,11 --port 33823" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.281-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.281-04:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.281-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.294-04:00 level=INFO source=runner.go:765 msg="starting ollama engine" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.297-04:00 level=INFO source=runner.go:828 msg="Server listening on 127.0.0.1:33823" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.429-04:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.429-04:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.429-04:00 level=INFO source=ggml.go:69 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=36 Mar 29 11:58:36 watson ollama[1408032]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 29 11:58:36 watson ollama[1408032]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 29 11:58:36 watson ollama[1408032]: ggml_cuda_init: found 2 CUDA devices: Mar 29 11:58:36 watson ollama[1408032]: Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes Mar 29 11:58:36 watson ollama[1408032]: Device 1: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.532-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Mar 29 11:58:36 watson ollama[1408032]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 29 11:58:36 watson ollama[1408032]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.560-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.631-04:00 level=INFO source=ggml.go:291 msg="model weights" buffer=CUDA1 size="2.9 GiB" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.631-04:00 level=INFO source=ggml.go:291 msg="model weights" buffer=CPU size="787.5 MiB" Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.631-04:00 level=INFO source=ggml.go:291 msg="model weights" buffer=CUDA0 size="4.7 GiB" Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.090-04:00 level=INFO source=ggml.go:383 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.090-04:00 level=INFO source=ggml.go:383 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.090-04:00 level=INFO source=ggml.go:383 msg="compute graph" backend=CPU buffer_type=CUDA_Host Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.091-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.093-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.096-04:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.103-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.103-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.103-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.103-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.103-04:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.291-04:00 level=INFO source=server.go:619 msg="llama runner started in 2.01 seconds" Mar 29 11:58:38 watson ollama[1408032]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1195.28 MiB on device 1: cudaMalloc failed: out of memory Mar 29 11:58:38 watson ollama[1408032]: ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1253346304 Mar 29 11:58:38 watson ollama[1408032]: SIGSEGV: segmentation violation Mar 29 11:58:38 watson ollama[1408032]: PC=0x59fdcdf550c0 m=11 sigcode=1 addr=0x58 Mar 29 11:58:38 watson ollama[1408032]: signal arrived during cgo execution Mar 29 11:58:38 watson ollama[1408032]: goroutine 20 gp=0xc0001c7880 m=11 mp=0xc000307808 [syscall]: Mar 29 11:58:38 watson ollama[1408032]: runtime.cgocall(0x59fdcdfa90d0, 0xc000229aa8) Mar 29 11:58:38 watson ollama[1408032]: runtime/cgocall.go:167 +0x4b fp=0xc000229a80 sp=0xc000229a48 pc=0x59fdcd17496b Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x7a2d38003a70, 0x7a303a400d50) Mar 29 11:58:38 watson ollama[1408032]: _cgo_gotypes.go:485 +0x4a fp=0xc000229aa8 sp=0xc000229a80 pc=0x59fdcd56e9aa Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/ml/backend/ggml.Context.Compute.func1(...) Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/ml/backend/ggml/ggml.go:524 Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/ml/backend/ggml.Context.Compute({0xc000412200, 0x7a2d38003940, 0x7a303a400d50, 0x0, 0x2000}, {0xc0031266f0, 0x1, 0xc00258a120?}) Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/ml/backend/ggml/ggml.go:524 +0xbd fp=0xc000229b38 sp=0xc000229aa8 pc=0x59fdcd5774fd Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute(0xc0030f44b0?, {0xc0031266f0?, 0xc00258a090?, 0x1?}) Mar 29 11:58:38 watson ollama[1408032]: <autogenerated>:1 +0x72 fp=0xc000229bb0 sp=0xc000229b38 pc=0x59fdcd57cf72 Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/model.Forward({0x59fdce45c600, 0xc0030f44b0}, {0x59fdce453c90, 0xc00030a0e0}, {0xc0030c1000, 0x11e, 0x200}, {{0x59fdce464c10, 0xc00258a0a8}, {0xc00258a090, ...}, ...}) Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/model/model.go:312 +0x2b8 fp=0xc000229c90 sp=0xc000229bb0 pc=0x59fdcd5a41f8 Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.(*Server).processBatch(0xc00053b9e0) Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner/runner.go:424 +0x3fe fp=0xc000229f98 sp=0xc000229c90 pc=0x59fdcd62775e Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc00053b9e0, {0x59fdce454fc0, 0xc000531db0}) Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner/runner.go:336 +0x4e fp=0xc000229fb8 sp=0xc000229f98 pc=0x59fdcd62730e Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap2() Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner/runner.go:805 +0x28 fp=0xc000229fe0 sp=0xc000229fb8 pc=0x59fdcd62b708 Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000229fe8 sp=0xc000229fe0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner/runner.go:805 +0xb37 Mar 29 11:58:38 watson ollama[1408032]: goroutine 1 gp=0xc000002380 m=nil [IO wait]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc000225628 sp=0xc000225608 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.netpollblock(0xc000225678?, 0xcd111426?, 0xfd?) Mar 29 11:58:38 watson ollama[1408032]: runtime/netpoll.go:575 +0xf7 fp=0xc000225660 sp=0xc000225628 pc=0x59fdcd13ca57 Mar 29 11:58:38 watson ollama[1408032]: internal/poll.runtime_pollWait(0x7a30a2a6deb0, 0x72) Mar 29 11:58:38 watson ollama[1408032]: runtime/netpoll.go:351 +0x85 fp=0xc000225680 sp=0xc000225660 pc=0x59fdcd176e85 Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*pollDesc).wait(0xc00050f300?, 0x900000036?, 0x0) Mar 29 11:58:38 watson ollama[1408032]: internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0002256a8 sp=0xc000225680 pc=0x59fdcd1fe307 Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*pollDesc).waitRead(...) Mar 29 11:58:38 watson ollama[1408032]: internal/poll/fd_poll_runtime.go:89 Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*FD).Accept(0xc00050f300) Mar 29 11:58:38 watson ollama[1408032]: internal/poll/fd_unix.go:620 +0x295 fp=0xc000225750 sp=0xc0002256a8 pc=0x59fdcd2036d5 Mar 29 11:58:38 watson ollama[1408032]: net.(*netFD).accept(0xc00050f300) Mar 29 11:58:38 watson ollama[1408032]: net/fd_unix.go:172 +0x29 fp=0xc000225808 sp=0xc000225750 pc=0x59fdcd2764e9 Mar 29 11:58:38 watson ollama[1408032]: net.(*TCPListener).accept(0xc00053c000) Mar 29 11:58:38 watson ollama[1408032]: net/tcpsock_posix.go:159 +0x1b fp=0xc000225858 sp=0xc000225808 pc=0x59fdcd28be9b Mar 29 11:58:38 watson ollama[1408032]: net.(*TCPListener).Accept(0xc00053c000) Mar 29 11:58:38 watson ollama[1408032]: net/tcpsock.go:380 +0x30 fp=0xc000225888 sp=0xc000225858 pc=0x59fdcd28ad50 Mar 29 11:58:38 watson ollama[1408032]: net/http.(*onceCloseListener).Accept(0xc000126360?) Mar 29 11:58:38 watson ollama[1408032]: <autogenerated>:1 +0x24 fp=0xc0002258a0 sp=0xc000225888 pc=0x59fdcd4a2384 Mar 29 11:58:38 watson ollama[1408032]: net/http.(*Server).Serve(0xc0001e8e00, {0x59fdce452cf8, 0xc00053c000}) Mar 29 11:58:38 watson ollama[1408032]: net/http/server.go:3424 +0x30c fp=0xc0002259d0 sp=0xc0002258a0 pc=0x59fdcd479c4c Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.Execute({0xc000034170, 0x11, 0x11}) Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner/runner.go:829 +0xec9 fp=0xc000225d08 sp=0xc0002259d0 pc=0x59fdcd62b469 Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner.Execute({0xc000034150?, 0x0?, 0x0?}) Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/runner.go:20 +0xc9 fp=0xc000225d30 sp=0xc000225d08 pc=0x59fdcd62c0e9 Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/cmd.NewCLI.func2(0xc0001e9200?, {0x59fdcdfc4055?, 0x4?, 0x59fdcdfc4059?}) Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/cmd/cmd.go:1329 +0x45 fp=0xc000225d58 sp=0xc000225d30 pc=0x59fdcdd79b25 Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra.(*Command).execute(0xc000128f08, {0xc00053b7a0, 0x12, 0x12}) Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000225e78 sp=0xc000225d58 pc=0x59fdcd2efb3c Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra.(*Command).ExecuteC(0xc0004a2f08) Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000225f30 sp=0xc000225e78 pc=0x59fdcd2f0385 Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra.(*Command).Execute(...) Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra@v1.7.0/command.go:992 Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra.(*Command).ExecuteContext(...) Mar 29 11:58:38 watson ollama[1408032]: github.com/spf13/cobra@v1.7.0/command.go:985 Mar 29 11:58:38 watson ollama[1408032]: main.main() Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000225f50 sp=0xc000225f30 pc=0x59fdcdd79e8d Mar 29 11:58:38 watson ollama[1408032]: runtime.main() Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:283 +0x29d fp=0xc000225fe0 sp=0xc000225f50 pc=0x59fdcd14405d Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000225fe8 sp=0xc000225fe0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc000064fa8 sp=0xc000064f88 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.goparkunlock(...) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:441 Mar 29 11:58:38 watson ollama[1408032]: runtime.forcegchelper() Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:348 +0xb8 fp=0xc000064fe0 sp=0xc000064fa8 pc=0x59fdcd144398 Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000064fe8 sp=0xc000064fe0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by runtime.init.7 in goroutine 1 Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:336 +0x1a Mar 29 11:58:38 watson ollama[1408032]: goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc000065780 sp=0xc000065760 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.goparkunlock(...) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:441 Mar 29 11:58:38 watson ollama[1408032]: runtime.bgsweep(0xc00007e000) Mar 29 11:58:38 watson ollama[1408032]: runtime/mgcsweep.go:316 +0xdf fp=0xc0000657c8 sp=0xc000065780 pc=0x59fdcd12ea5f Mar 29 11:58:38 watson ollama[1408032]: runtime.gcenable.gowrap1() Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:204 +0x25 fp=0xc0000657e0 sp=0xc0000657c8 pc=0x59fdcd122e45 Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000657e8 sp=0xc0000657e0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcenable in goroutine 1 Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:204 +0x66 Mar 29 11:58:38 watson ollama[1408032]: goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x1285dfdc?, 0x127e78d7?, 0x0?, 0x0?, 0x0?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc000065f78 sp=0xc000065f58 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.goparkunlock(...) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:441 Mar 29 11:58:38 watson ollama[1408032]: runtime.(*scavengerState).park(0x59fdcecbb1c0) Mar 29 11:58:38 watson ollama[1408032]: runtime/mgcscavenge.go:425 +0x49 fp=0xc000065fa8 sp=0xc000065f78 pc=0x59fdcd12c4a9 Mar 29 11:58:38 watson ollama[1408032]: runtime.bgscavenge(0xc00007e000) Mar 29 11:58:38 watson ollama[1408032]: runtime/mgcscavenge.go:658 +0x59 fp=0xc000065fc8 sp=0xc000065fa8 pc=0x59fdcd12ca39 Mar 29 11:58:38 watson ollama[1408032]: runtime.gcenable.gowrap2() Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:205 +0x25 fp=0xc000065fe0 sp=0xc000065fc8 pc=0x59fdcd122de5 Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000065fe8 sp=0xc000065fe0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcenable in goroutine 1 Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:205 +0xa5 Mar 29 11:58:38 watson ollama[1408032]: goroutine 5 gp=0xc000003dc0 m=nil [finalizer wait]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000064688?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc000064630 sp=0xc000064610 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.runfinq() Mar 29 11:58:38 watson ollama[1408032]: runtime/mfinal.go:196 +0x107 fp=0xc0000647e0 sp=0xc000064630 pc=0x59fdcd121e07 Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000647e8 sp=0xc0000647e0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by runtime.createfing in goroutine 1 Mar 29 11:58:38 watson ollama[1408032]: runtime/mfinal.go:166 +0x3d Mar 29 11:58:38 watson ollama[1408032]: goroutine 6 gp=0xc0001c68c0 m=nil [chan receive]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0xc0002197c0?, 0xc0030fb0e0?, 0x60?, 0x67?, 0x59fdcd25d228?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc000066718 sp=0xc0000666f8 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.chanrecv(0xc0000423f0, 0x0, 0x1) Mar 29 11:58:38 watson ollama[1408032]: runtime/chan.go:664 +0x445 fp=0xc000066790 sp=0xc000066718 pc=0x59fdcd114005 Mar 29 11:58:38 watson ollama[1408032]: runtime.chanrecv1(0x0?, 0x0?) Mar 29 11:58:38 watson ollama[1408032]: runtime/chan.go:506 +0x12 fp=0xc0000667b8 sp=0xc000066790 pc=0x59fdcd113b92 Mar 29 11:58:38 watson ollama[1408032]: runtime.unique_runtime_registerUniqueMapCleanup.func2(...) Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1796 Mar 29 11:58:38 watson ollama[1408032]: runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1799 +0x2f fp=0xc0000667e0 sp=0xc0000667b8 pc=0x59fdcd125fef Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000667e8 sp=0xc0000667e0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by unique.runtime_registerUniqueMapCleanup in goroutine 1 Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1794 +0x85 Mar 29 11:58:38 watson ollama[1408032]: goroutine 7 gp=0xc0001c7180 m=nil [GC worker (idle)]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x22a48fd53c1d7?, 0x3?, 0xf3?, 0x4f?, 0x0?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc000066f38 sp=0xc000066f18 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkWorker(0xc0000439d0) Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1423 +0xe9 fp=0xc000066fc8 sp=0xc000066f38 pc=0x59fdcd125309 Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkStartWorkers.gowrap1() Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1339 +0x25 fp=0xc000066fe0 sp=0xc000066fc8 pc=0x59fdcd1251e5 Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000066fe8 sp=0xc000066fe0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1339 +0x105 Mar 29 11:58:38 watson ollama[1408032]: goroutine 8 gp=0xc0001c7340 m=nil [GC worker (idle)]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x22a4912efa1bd?, 0x3?, 0x8f?, 0xd0?, 0x0?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc000067738 sp=0xc000067718 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkWorker(0xc0000439d0) Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1423 +0xe9 fp=0xc0000677c8 sp=0xc000067738 pc=0x59fdcd125309 Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkStartWorkers.gowrap1() Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1339 +0x25 fp=0xc0000677e0 sp=0xc0000677c8 pc=0x59fdcd1251e5 Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000677e8 sp=0xc0000677e0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1339 +0x105 Mar 29 11:58:38 watson ollama[1408032]: goroutine 9 gp=0xc0001c7500 m=nil [GC worker (idle)]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x22a4912efbbf0?, 0x3?, 0x3f?, 0xc4?, 0x0?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc000067f38 sp=0xc000067f18 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkWorker(0xc0000439d0) Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1423 +0xe9 fp=0xc000067fc8 sp=0xc000067f38 pc=0x59fdcd125309 Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkStartWorkers.gowrap1() Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1339 +0x25 fp=0xc000067fe0 sp=0xc000067fc8 pc=0x59fdcd1251e5 Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1339 +0x105 Mar 29 11:58:38 watson ollama[1408032]: goroutine 18 gp=0xc000102380 m=nil [GC worker (idle)]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x22a4912efa23c?, 0x1?, 0xc5?, 0xfb?, 0x0?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc000060738 sp=0xc000060718 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkWorker(0xc0000439d0) Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1423 +0xe9 fp=0xc0000607c8 sp=0xc000060738 pc=0x59fdcd125309 Mar 29 11:58:38 watson ollama[1408032]: runtime.gcBgMarkStartWorkers.gowrap1() Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1339 +0x25 fp=0xc0000607e0 sp=0xc0000607c8 pc=0x59fdcd1251e5 Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000607e8 sp=0xc0000607e0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by runtime.gcBgMarkStartWorkers in goroutine 1 Mar 29 11:58:38 watson ollama[1408032]: runtime/mgc.go:1339 +0x105 Mar 29 11:58:38 watson ollama[1408032]: goroutine 10 gp=0xc0001c6700 m=nil [select]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0xc000049a08?, 0x2?, 0x0?, 0xc6?, 0xc000049864?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc000049678 sp=0xc000049658 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.selectgo(0xc000049a08, 0xc000049860, 0x11e?, 0x0, 0x4?, 0x1) Mar 29 11:58:38 watson ollama[1408032]: runtime/select.go:351 +0x837 fp=0xc0000497b0 sp=0xc000049678 pc=0x59fdcd156557 Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc00053b9e0, {0x59fdce452ed8, 0xc0052440e0}, 0xc003021e00) Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner/runner.go:621 +0xae5 fp=0xc000049ac0 sp=0xc0000497b0 pc=0x59fdcd6298c5 Mar 29 11:58:38 watson ollama[1408032]: github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x59fdce452ed8?, 0xc0052440e0?}, 0xc000223b40?) Mar 29 11:58:38 watson ollama[1408032]: <autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x59fdcd62bf56 Mar 29 11:58:38 watson ollama[1408032]: net/http.HandlerFunc.ServeHTTP(0xc000550000?, {0x59fdce452ed8?, 0xc0052440e0?}, 0xc000223b60?) Mar 29 11:58:38 watson ollama[1408032]: net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x59fdcd476289 Mar 29 11:58:38 watson ollama[1408032]: net/http.(*ServeMux).ServeHTTP(0x59fdcd11c325?, {0x59fdce452ed8, 0xc0052440e0}, 0xc003021e00) Mar 29 11:58:38 watson ollama[1408032]: net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x59fdcd478184 Mar 29 11:58:38 watson ollama[1408032]: net/http.serverHandler.ServeHTTP({0x59fdce44f570?}, {0x59fdce452ed8?, 0xc0052440e0?}, 0x1?) Mar 29 11:58:38 watson ollama[1408032]: net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x59fdcd495c0e Mar 29 11:58:38 watson ollama[1408032]: net/http.(*conn).serve(0xc000126360, {0x59fdce454f88, 0xc0000ebd40}) Mar 29 11:58:38 watson ollama[1408032]: net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x59fdcd474785 Mar 29 11:58:38 watson ollama[1408032]: net/http.(*Server).Serve.gowrap3() Mar 29 11:58:38 watson ollama[1408032]: net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x59fdcd47a048 Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by net/http.(*Server).Serve in goroutine 1 Mar 29 11:58:38 watson ollama[1408032]: net/http/server.go:3454 +0x485 Mar 29 11:58:38 watson ollama[1408032]: goroutine 1071 gp=0xc0001c7c00 m=nil [IO wait]: Mar 29 11:58:38 watson ollama[1408032]: runtime.gopark(0x0?, 0xc000061768?, 0x93?, 0x5f?, 0xb?) Mar 29 11:58:38 watson ollama[1408032]: runtime/proc.go:435 +0xce fp=0xc0000615d8 sp=0xc0000615b8 pc=0x59fdcd177c6e Mar 29 11:58:38 watson ollama[1408032]: runtime.netpollblock(0x59fdcd19b0f8?, 0xcd111426?, 0xfd?) Mar 29 11:58:38 watson ollama[1408032]: runtime/netpoll.go:575 +0xf7 fp=0xc000061610 sp=0xc0000615d8 pc=0x59fdcd13ca57 Mar 29 11:58:38 watson ollama[1408032]: internal/poll.runtime_pollWait(0x7a30a2a6dd98, 0x72) Mar 29 11:58:38 watson ollama[1408032]: runtime/netpoll.go:351 +0x85 fp=0xc000061630 sp=0xc000061610 pc=0x59fdcd176e85 Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*pollDesc).wait(0xc00050e100?, 0xc0002120a1?, 0x0) Mar 29 11:58:38 watson ollama[1408032]: internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000061658 sp=0xc000061630 pc=0x59fdcd1fe307 Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*pollDesc).waitRead(...) Mar 29 11:58:38 watson ollama[1408032]: internal/poll/fd_poll_runtime.go:89 Mar 29 11:58:38 watson ollama[1408032]: internal/poll.(*FD).Read(0xc00050e100, {0xc0002120a1, 0x1, 0x1}) Mar 29 11:58:38 watson ollama[1408032]: internal/poll/fd_unix.go:165 +0x27a fp=0xc0000616f0 sp=0xc000061658 pc=0x59fdcd1ff5fa Mar 29 11:58:38 watson ollama[1408032]: net.(*netFD).Read(0xc00050e100, {0xc0002120a1?, 0x59fdcd56d689?, 0xc000061770?}) Mar 29 11:58:38 watson ollama[1408032]: net/fd_posix.go:55 +0x25 fp=0xc000061738 sp=0xc0000616f0 pc=0x59fdcd274545 Mar 29 11:58:38 watson ollama[1408032]: net.(*conn).Read(0xc000068208, {0xc0002120a1?, 0xc0030471c0?, 0x59fdcd56d640?}) Mar 29 11:58:38 watson ollama[1408032]: net/net.go:194 +0x45 fp=0xc000061780 sp=0xc000061738 pc=0x59fdcd282905 Mar 29 11:58:38 watson ollama[1408032]: net/http.(*connReader).backgroundRead(0xc000212090) Mar 29 11:58:38 watson ollama[1408032]: net/http/server.go:690 +0x37 fp=0xc0000617c8 sp=0xc000061780 pc=0x59fdcd46e657 Mar 29 11:58:38 watson ollama[1408032]: net/http.(*connReader).startBackgroundRead.gowrap2() Mar 29 11:58:38 watson ollama[1408032]: net/http/server.go:686 +0x25 fp=0xc0000617e0 sp=0xc0000617c8 pc=0x59fdcd46e585 Mar 29 11:58:38 watson ollama[1408032]: runtime.goexit({}) Mar 29 11:58:38 watson ollama[1408032]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000617e8 sp=0xc0000617e0 pc=0x59fdcd17f3a1 Mar 29 11:58:38 watson ollama[1408032]: created by net/http.(*connReader).startBackgroundRead in goroutine 10 Mar 29 11:58:38 watson ollama[1408032]: net/http/server.go:686 +0xb6 Mar 29 11:58:38 watson ollama[1408032]: rax 0x7a2d3837e8e0 Mar 29 11:58:38 watson ollama[1408032]: rbx 0x7a2d3837e850 Mar 29 11:58:38 watson ollama[1408032]: rcx 0x3 Mar 29 11:58:38 watson ollama[1408032]: rdx 0x7a2d38521b50 Mar 29 11:58:38 watson ollama[1408032]: rdi 0x0 Mar 29 11:58:38 watson ollama[1408032]: rsi 0x7a2fa0a00030 Mar 29 11:58:38 watson ollama[1408032]: rbp 0x7a2d38521b48 Mar 29 11:58:38 watson ollama[1408032]: rsp 0x7a303b3ffc48 Mar 29 11:58:38 watson ollama[1408032]: r8 0x4 Mar 29 11:58:38 watson ollama[1408032]: r9 0xc000068048 Mar 29 11:58:38 watson ollama[1408032]: r10 0x1 Mar 29 11:58:38 watson ollama[1408032]: r11 0x216 Mar 29 11:58:38 watson ollama[1408032]: r12 0x1 Mar 29 11:58:38 watson ollama[1408032]: r13 0x7a2d38003bc8 Mar 29 11:58:38 watson ollama[1408032]: r14 0xc2f Mar 29 11:58:38 watson ollama[1408032]: r15 0x7a2d3837e850 Mar 29 11:58:38 watson ollama[1408032]: rip 0x59fdcdf550c0 Mar 29 11:58:38 watson ollama[1408032]: rflags 0x10206 Mar 29 11:58:38 watson ollama[1408032]: cs 0x33 Mar 29 11:58:38 watson ollama[1408032]: fs 0x0 Mar 29 11:58:38 watson ollama[1408032]: gs 0x0 Mar 29 11:58:38 watson ollama[1408032]: [GIN] 2025/03/29 - 11:58:38 | 500 | 8.841319466s | 192.168.50.221 | POST "/api/chat" Mar 29 11:58:38 watson ollama[1408032]: time=2025-03-29T11:58:38.730-04:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 2" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.3
GiteaMirror added the bug label 2026-04-22 13:31:37 -05:00
Author
Owner

@Master-Pr0grammer commented on GitHub (Mar 29, 2025):

This is a similar bug related to https://github.com/ollama/ollama/issues/9699#issuecomment-2763272669, but I felt like my issue is more isolated and that I could provide more detailed info to help resolve the underlying issue

<!-- gh-comment-id:2763652825 --> @Master-Pr0grammer commented on GitHub (Mar 29, 2025): This is a similar bug related to https://github.com/ollama/ollama/issues/9699#issuecomment-2763272669, but I felt like my issue is more isolated and that I could provide more detailed info to help resolve the underlying issue
Author
Owner

@rick-github commented on GitHub (Mar 29, 2025):

Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.138-04:00 level=INFO source=server.go:138 msg=offload
 library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split=38,11 memory.available="[10.6 GiB 3.9 GiB]"
 memory.gpu_overhead="0 B" memory.required.full="14.3 GiB" memory.required.partial="14.3 GiB" memory.required.kv="1.9 GiB"
 memory.required.allocations="[10.5 GiB 3.7 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB"
 memory.weights.nonrepeating="787.5 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB"
 projector.weights="795.9 MiB" projector.graph="1.0 GiB"

The context is pre-allocated, but there are temporary allocations during inference. ollama has estimated that it can use [10.5G, 3.7G] of [10.6G, 3.9G] so there's not a lot of room for these allocations. Mitigations can be found here.

<!-- gh-comment-id:2763979746 --> @rick-github commented on GitHub (Mar 29, 2025): ``` Mar 29 11:58:36 watson ollama[1408032]: time=2025-03-29T11:58:36.138-04:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split=38,11 memory.available="[10.6 GiB 3.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.3 GiB" memory.required.partial="14.3 GiB" memory.required.kv="1.9 GiB" memory.required.allocations="[10.5 GiB 3.7 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" ``` The context is pre-allocated, but there are temporary allocations during inference. ollama has estimated that it can use [10.5G, 3.7G] of [10.6G, 3.9G] so there's not a lot of room for these allocations. Mitigations can be found [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).
Author
Owner

@Master-Pr0grammer commented on GitHub (Apr 1, 2025):

@rick-github Thanks for the feedback, but if you take a look at the nvidia-smi txt file I attached above, once the model is fully loaded, I still have 4.5 GB remaining on device 0 to allocate, which is more than enough.

The problem is that it is trying to make the allocations on device 1, which only has ~400Mib remaining.

"ollama has estimated that it can use [10.5G, 3.7G] of [10.6G, 3.9G] so there's not a lot of room for these allocations." - this is my point, it can use [10.5G, 3.7G], but its not, and its only using half, and its trying to allocate more.

<!-- gh-comment-id:2769310031 --> @Master-Pr0grammer commented on GitHub (Apr 1, 2025): @rick-github Thanks for the feedback, but if you take a look at the nvidia-smi txt file I attached above, once the model is fully loaded, I still have 4.5 GB remaining on device 0 to allocate, which is more than enough. The problem is that it is trying to make the allocations on device 1, which only has ~400Mib remaining. "ollama has estimated that it can use [10.5G, 3.7G] of [10.6G, 3.9G] so there's not a lot of room for these allocations." - this is my point, it can use [10.5G, 3.7G], but its not, and its only using half, and its trying to allocate more.
Author
Owner

@rick-github commented on GitHub (Apr 1, 2025):

#9791

<!-- gh-comment-id:2769322426 --> @rick-github commented on GitHub (Apr 1, 2025): #9791
Author
Owner

@Master-Pr0grammer commented on GitHub (Apr 1, 2025):

ah ok, so this is a known bug/problem then?

Ollama should not be crashing here. It's crashing when allocating 1.1Gb, despite having over 4.5Gb left.

Something is definitely wrong with vision memory allocation.

<!-- gh-comment-id:2769443444 --> @Master-Pr0grammer commented on GitHub (Apr 1, 2025): ah ok, so this is a known bug/problem then? Ollama should not be crashing here. It's crashing when allocating 1.1Gb, despite having over 4.5Gb left. Something is definitely wrong with vision memory allocation.
Author
Owner

@Master-Pr0grammer commented on GitHub (Apr 4, 2025):

@rick-github Yeah, definitely something wrong with gemma3 architecture on ollama, I recently tried the new QAT Q4 checkpoints for gemma 3 from google, and same error even on the 4b model, it's only using 4Gb (on one GPU) out of 16Gb.

"runtime error: invalid memory address or nil pointer dereference" - yup definitely improper memory management

Only runs into the issue with image input.

<!-- gh-comment-id:2779524106 --> @Master-Pr0grammer commented on GitHub (Apr 4, 2025): @rick-github Yeah, definitely something wrong with gemma3 architecture on ollama, I recently tried the new QAT Q4 checkpoints for gemma 3 from google, and same error even on the 4b model, it's only using 4Gb (on one GPU) out of 16Gb. "runtime error: invalid memory address or nil pointer dereference" - yup definitely improper memory management Only runs into the issue with image input.
Author
Owner

@jessegross commented on GitHub (Apr 19, 2025):

It looks like there are a couple of issues here:

  • There is a mismatch between the old and new engines in where they put the vision graphs. The old engine puts it on the first GPU whereas the new engine uses the last GPU. The memory estimation logic still follows the old engine, which is why you see Ollama trying to allocate memory on the second GPU while there is free space on the first one.
  • Even in cases where we estimate incorrectly, we shouldn't fail in the middle of interference - if we have to fail, it would be better to do it at startup. This is improved in the upcoming 0.6.6 release but so far it only covers text, not images.

As @rick-github has mentioned elsewhere, memory estimation is pretty fragile in today's world since there are a lot of factors that affect it and it's hard to account for all of them by manually updating the estimation logic, leaving to both over and under estimates. We're working on a new system that will compute this automatically, leading to much more accurate results.

<!-- gh-comment-id:2816399723 --> @jessegross commented on GitHub (Apr 19, 2025): It looks like there are a couple of issues here: - There is a mismatch between the old and new engines in where they put the vision graphs. The old engine puts it on the first GPU whereas the new engine uses the last GPU. The memory estimation logic still follows the old engine, which is why you see Ollama trying to allocate memory on the second GPU while there is free space on the first one. - Even in cases where we estimate incorrectly, we shouldn't fail in the middle of interference - if we have to fail, it would be better to do it at startup. This is improved in the upcoming 0.6.6 release but so far it only covers text, not images. As @rick-github has mentioned elsewhere, memory estimation is pretty fragile in today's world since there are a lot of factors that affect it and it's hard to account for all of them by manually updating the estimation logic, leaving to both over and under estimates. We're working on a new system that will compute this automatically, leading to much more accurate results.
Author
Owner

@Master-Pr0grammer commented on GitHub (Apr 24, 2025):

Awesome, I just tried the new 6.6 release, still having the same issues regarding this issue with images, however I noticed some new issues with text only inference.

With the recent updates I noticed SIGNIFIGANT performance increases in gemma models, speed, memory usage, and general model "intelligence" performance on standard benchmarks, which I'd like to give a HUGE thanks to the ollama team for. However there's just 1 or 2 memory allocation bugs remaining.

I have a stress testing script I use to determine the max context size I can run before offloading to cpu, I have some cases where the ollama cuda memory estimation fails and ollama crashes (and why it looked like it crashed from very briefly skimming over the logs):

gemma3:4b - works fine, no crashes
gemma3:4b-it-qat - "llama runner process has terminated: exit status 2"
hf.co/stduhpf/google-gemma-3-4b-it-qat-q4_0-gguf-small:latest - "llama runner process has terminated: exit status 2 (status code: 500)" (ollama entered 'panic' mode, GPU VRAM did not recover in time window)

gemma3:12b - "llama runner terminated" error="exit status 2" (ollama tried to incorrectly allocate 100% to gpu 0, and tried to allocate more than was available on gpu 0, it should have split it between the two gpu's in this case)
gemma3:12b-it-qat - "gpu VRAM usage didn't recover within timeout" (under-calculated vram usage on gpu1, estimated 3.8Gb, tried to allocate 4.31Gb, only 3.9 available)
hf.co/stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:latest - llama runner terminated" error="exit status 2" (similar error to gemma3:12b I think)

This is not the same specific issue, but its related to cuda memory estimation which seems relevant to the original issue, and these cases might be useful for future improvements/debugging.

In the future when I have time, I can post the stress testing script, and the logs later if it would be useful.

<!-- gh-comment-id:2828638383 --> @Master-Pr0grammer commented on GitHub (Apr 24, 2025): Awesome, I just tried the new 6.6 release, still having the same issues regarding this issue with images, however I noticed some new issues with text only inference. With the recent updates I noticed SIGNIFIGANT performance increases in gemma models, speed, memory usage, and general model "intelligence" performance on standard benchmarks, which I'd like to give a HUGE thanks to the ollama team for. However there's just 1 or 2 memory allocation bugs remaining. I have a stress testing script I use to determine the max context size I can run before offloading to cpu, I have some cases where the ollama cuda memory estimation fails and ollama crashes (and why it looked like it crashed from very briefly skimming over the logs): ``` gemma3:4b - works fine, no crashes gemma3:4b-it-qat - "llama runner process has terminated: exit status 2" hf.co/stduhpf/google-gemma-3-4b-it-qat-q4_0-gguf-small:latest - "llama runner process has terminated: exit status 2 (status code: 500)" (ollama entered 'panic' mode, GPU VRAM did not recover in time window) gemma3:12b - "llama runner terminated" error="exit status 2" (ollama tried to incorrectly allocate 100% to gpu 0, and tried to allocate more than was available on gpu 0, it should have split it between the two gpu's in this case) gemma3:12b-it-qat - "gpu VRAM usage didn't recover within timeout" (under-calculated vram usage on gpu1, estimated 3.8Gb, tried to allocate 4.31Gb, only 3.9 available) hf.co/stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:latest - llama runner terminated" error="exit status 2" (similar error to gemma3:12b I think) ``` This is not the same specific issue, but its related to cuda memory estimation which seems relevant to the original issue, and these cases might be useful for future improvements/debugging. In the future when I have time, I can post the stress testing script, and the logs later if it would be useful.
Author
Owner

@alllexx88 commented on GitHub (May 22, 2025):

I can confirm a similar issue when trying to process an image with 0.7.0, Nvidia GPU and llama3.2-vision model.

<!-- gh-comment-id:2901246208 --> @alllexx88 commented on GitHub (May 22, 2025): I can confirm a similar issue when trying to process an image with 0.7.0, Nvidia GPU and llama3.2-vision model.
Author
Owner

@rick-github commented on GitHub (May 22, 2025):

0.7.1 has a fix for a llama3.2-vision crash.

<!-- gh-comment-id:2901266703 --> @rick-github commented on GitHub (May 22, 2025): 0.7.1 has a fix for a llama3.2-vision crash.
Author
Owner

@alllexx88 commented on GitHub (May 22, 2025):

@rick-github Thank you, something has really changed. The error message is different now (with 0.7.1-rc0):

ollama._types.ResponseError: llama runner process has terminated: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 3070038528 (status code: 500)

Also, this happens with 2 different nvidia GPU. If I pass CUDA_VISIBLE_DEVICES=0, it works, though the model doesn't fully fit in VRAM, maybe that's the reason why it works.

UPD: the logs when it crashes:

May 22 15:53:08 AlexArch ollama[302713]: time=2025-05-22T15:53:08.445+02:00 level=WARN source=sched.go:140 msg="mllama does not currently support parallel requests"
May 22 15:53:08 AlexArch ollama[302713]: time=2025-05-22T15:53:08.747+02:00 level=INFO source=sched.go:793 msg="new model will fit in available VRAM, loading" model=/var/lib/ollama/blobs/sha256-7633fdffe14c0f7acc115402376be5bd6052220c348676c5133dc011b35e2429 library=cuda parallel=1 required="13.1 GiB"
May 22 15:53:08 AlexArch ollama[302713]: [GIN] 2025/05/22 - 15:53:08 | 200 |      30.366µs |       127.0.0.1 | HEAD     "/"
May 22 15:53:08 AlexArch ollama[302713]: [GIN] 2025/05/22 - 15:53:08 | 200 |      17.077µs |       127.0.0.1 | GET      "/api/ps"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.095+02:00 level=INFO source=server.go:135 msg="system memory" total="93.8 GiB" free="78.3 GiB" free_swap="56.0 KiB"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.096+02:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[11.4 GiB 5.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="13.1 GiB" memory.required.partial="13.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[9.0 GiB 4.1 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="669.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.7 GiB" projector.graph="2.8 GiB"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.123+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /var/lib/ollama/blobs/sha256-7633fdffe14c0f7acc115402376be5bd6052220c348676c5133dc011b35e2429 --ctx-size 4096 --batch-size 512 --n-gpu-layers 41 --threads 6 --parallel 1 --tensor-split 21,20 --port 40805"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.123+02:00 level=INFO source=sched.go:472 msg="loaded runners" count=1
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.123+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.123+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.133+02:00 level=INFO source=runner.go:917 msg="starting ollama engine"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.133+02:00 level=INFO source=runner.go:977 msg="Server listening on 127.0.0.1:40805"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.155+02:00 level=INFO source=ggml.go:87 msg="" architecture=mllama file_type=Q4_K_M name="" description="" num_tensors=908 num_key_values=39
May 22 15:53:09 AlexArch ollama[302713]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
May 22 15:53:09 AlexArch ollama[302713]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
May 22 15:53:09 AlexArch ollama[302713]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
May 22 15:53:09 AlexArch ollama[302713]: ggml_cuda_init: found 2 CUDA devices:
May 22 15:53:09 AlexArch ollama[302713]:   Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
May 22 15:53:09 AlexArch ollama[302713]:   Device 1: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
May 22 15:53:09 AlexArch ollama[302713]: load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.318+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.375+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
May 22 15:53:09 AlexArch ollama[302713]: [GIN] 2025/05/22 - 15:53:09 | 200 |      29.468µs |       127.0.0.1 | HEAD     "/"
May 22 15:53:09 AlexArch ollama[302713]: [GIN] 2025/05/22 - 15:53:09 | 200 |      32.865µs |       127.0.0.1 | GET      "/api/ps"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.392+02:00 level=INFO source=ggml.go:313 msg="model weights" buffer=CPU size="285.7 MiB"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.392+02:00 level=INFO source=ggml.go:313 msg="model weights" buffer=CUDA1 size="4.6 GiB"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.392+02:00 level=INFO source=ggml.go:313 msg="model weights" buffer=CUDA0 size="2.6 GiB"
May 22 15:53:09 AlexArch ollama[302713]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2927.82 MiB on device 1: cudaMalloc failed: out of memory
May 22 15:53:09 AlexArch ollama[302713]: ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 3070038528
May 22 15:53:09 AlexArch ollama[302713]: panic: failed to reserve graph
May 22 15:53:09 AlexArch ollama[302713]: goroutine 5 [running]:
May 22 15:53:09 AlexArch ollama[302713]: github.com/ollama/ollama/runner/ollamarunner.(*Server).loadModel(0xc0001a67e0, {0x58afbcaf5210, 0xc0005ab630}, {0x7ffe7a056c2c?, 0x0?}, {0x6, 0x0, 0x29, {0xc00059efa8, 0x2, ...}, ...}, ...)
May 22 15:53:09 AlexArch ollama[302713]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:874 +0x33d
May 22 15:53:09 AlexArch ollama[302713]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
May 22 15:53:09 AlexArch ollama[302713]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:950 +0x9c7
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.626+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.683+02:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.877+02:00 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 3070038528"

P.S. I thought this was the same as the current issue, so I posted here. If it's not, I'll be glad to open a new one, thanks

<!-- gh-comment-id:2901311504 --> @alllexx88 commented on GitHub (May 22, 2025): @rick-github Thank you, something has really changed. The error message is different now (with 0.7.1-rc0): ``` ollama._types.ResponseError: llama runner process has terminated: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 3070038528 (status code: 500) ``` Also, this happens with 2 different nvidia GPU. If I pass `CUDA_VISIBLE_DEVICES=0`, it works, though the model doesn't fully fit in VRAM, maybe that's the reason why it works. UPD: the logs when it crashes: ``` May 22 15:53:08 AlexArch ollama[302713]: time=2025-05-22T15:53:08.445+02:00 level=WARN source=sched.go:140 msg="mllama does not currently support parallel requests" May 22 15:53:08 AlexArch ollama[302713]: time=2025-05-22T15:53:08.747+02:00 level=INFO source=sched.go:793 msg="new model will fit in available VRAM, loading" model=/var/lib/ollama/blobs/sha256-7633fdffe14c0f7acc115402376be5bd6052220c348676c5133dc011b35e2429 library=cuda parallel=1 required="13.1 GiB" May 22 15:53:08 AlexArch ollama[302713]: [GIN] 2025/05/22 - 15:53:08 | 200 | 30.366µs | 127.0.0.1 | HEAD "/" May 22 15:53:08 AlexArch ollama[302713]: [GIN] 2025/05/22 - 15:53:08 | 200 | 17.077µs | 127.0.0.1 | GET "/api/ps" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.095+02:00 level=INFO source=server.go:135 msg="system memory" total="93.8 GiB" free="78.3 GiB" free_swap="56.0 KiB" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.096+02:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[11.4 GiB 5.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="13.1 GiB" memory.required.partial="13.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[9.0 GiB 4.1 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="669.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.7 GiB" projector.graph="2.8 GiB" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.123+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /var/lib/ollama/blobs/sha256-7633fdffe14c0f7acc115402376be5bd6052220c348676c5133dc011b35e2429 --ctx-size 4096 --batch-size 512 --n-gpu-layers 41 --threads 6 --parallel 1 --tensor-split 21,20 --port 40805" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.123+02:00 level=INFO source=sched.go:472 msg="loaded runners" count=1 May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.123+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.123+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.133+02:00 level=INFO source=runner.go:917 msg="starting ollama engine" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.133+02:00 level=INFO source=runner.go:977 msg="Server listening on 127.0.0.1:40805" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.155+02:00 level=INFO source=ggml.go:87 msg="" architecture=mllama file_type=Q4_K_M name="" description="" num_tensors=908 num_key_values=39 May 22 15:53:09 AlexArch ollama[302713]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so May 22 15:53:09 AlexArch ollama[302713]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no May 22 15:53:09 AlexArch ollama[302713]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no May 22 15:53:09 AlexArch ollama[302713]: ggml_cuda_init: found 2 CUDA devices: May 22 15:53:09 AlexArch ollama[302713]: Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes May 22 15:53:09 AlexArch ollama[302713]: Device 1: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes May 22 15:53:09 AlexArch ollama[302713]: load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.318+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.375+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" May 22 15:53:09 AlexArch ollama[302713]: [GIN] 2025/05/22 - 15:53:09 | 200 | 29.468µs | 127.0.0.1 | HEAD "/" May 22 15:53:09 AlexArch ollama[302713]: [GIN] 2025/05/22 - 15:53:09 | 200 | 32.865µs | 127.0.0.1 | GET "/api/ps" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.392+02:00 level=INFO source=ggml.go:313 msg="model weights" buffer=CPU size="285.7 MiB" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.392+02:00 level=INFO source=ggml.go:313 msg="model weights" buffer=CUDA1 size="4.6 GiB" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.392+02:00 level=INFO source=ggml.go:313 msg="model weights" buffer=CUDA0 size="2.6 GiB" May 22 15:53:09 AlexArch ollama[302713]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2927.82 MiB on device 1: cudaMalloc failed: out of memory May 22 15:53:09 AlexArch ollama[302713]: ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 3070038528 May 22 15:53:09 AlexArch ollama[302713]: panic: failed to reserve graph May 22 15:53:09 AlexArch ollama[302713]: goroutine 5 [running]: May 22 15:53:09 AlexArch ollama[302713]: github.com/ollama/ollama/runner/ollamarunner.(*Server).loadModel(0xc0001a67e0, {0x58afbcaf5210, 0xc0005ab630}, {0x7ffe7a056c2c?, 0x0?}, {0x6, 0x0, 0x29, {0xc00059efa8, 0x2, ...}, ...}, ...) May 22 15:53:09 AlexArch ollama[302713]: github.com/ollama/ollama/runner/ollamarunner/runner.go:874 +0x33d May 22 15:53:09 AlexArch ollama[302713]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 May 22 15:53:09 AlexArch ollama[302713]: github.com/ollama/ollama/runner/ollamarunner/runner.go:950 +0x9c7 May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.626+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.683+02:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" May 22 15:53:09 AlexArch ollama[302713]: time=2025-05-22T15:53:09.877+02:00 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 3070038528" ``` P.S. I thought this was the same as the current issue, so I posted here. If it's not, I'll be glad to open a new one, thanks
Author
Owner

@rick-github commented on GitHub (May 22, 2025):

Your problem looks like a general OOM issue rather than the specific gemma3 problem from this issue. Ways to mitigate out of memory failures can be found here. Work is ongoing to improve the memory estimation logic to reduce the frequency of OOMs. If the mitigations don't help, feel free open a new ticket.

<!-- gh-comment-id:2901349344 --> @rick-github commented on GitHub (May 22, 2025): Your problem looks like a general OOM issue rather than the specific gemma3 problem from this issue. Ways to mitigate out of memory failures can be found [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288). Work is ongoing to improve the memory estimation logic to reduce the frequency of OOMs. If the mitigations don't help, feel free open a new ticket.
Author
Owner

@jessegross commented on GitHub (May 22, 2025):

I think the original issues in this bug have been fixed. From https://github.com/ollama/ollama/issues/10041#issuecomment-2816399723:

  • Mismatch between vision projector location in Ollama engine vs. estimate: fixed in d755577473
  • Failing in the middle of inference on multimodal models when processing first image: fixed in fe623c2cf4

There are still general issues with estimates being incorrect but I'm going to close this one out.

<!-- gh-comment-id:2902076498 --> @jessegross commented on GitHub (May 22, 2025): I think the original issues in this bug have been fixed. From https://github.com/ollama/ollama/issues/10041#issuecomment-2816399723: - Mismatch between vision projector location in Ollama engine vs. estimate: fixed in https://github.com/ollama/ollama/commit/d75557747357bfb3afd441a0cc207ec944bd3a18 - Failing in the middle of inference on multimodal models when processing first image: fixed in https://github.com/ollama/ollama/commit/fe623c2cf44e672dde4552985d9f758a9d09605d There are still general issues with estimates being incorrect but I'm going to close this one out.
Author
Owner

@Master-Pr0grammer commented on GitHub (May 26, 2025):

@jessegross ok, I just tried out 0.7.1, and I am still having the same issue, however now it crashes at text inference as well.

it might be a different issue since all of the recent fixes, but it definitely seems like something is going wrong here.

When I run default gemma3:12b, stock settings, default context length (same system of 16Gb VRAM + 16Gb RAM):

May 25 23:54:00 watson ollama[2409907]: [GIN] 2025/05/25 - 23:54:00 | 200 |      59.836µs |       127.0.0.1 | GET      "/api/version"
May 25 23:54:08 watson ollama[2409907]: time=2025-05-25T23:54:08.698-04:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de library=cuda parallel=2 required="13.6 GiB"
May 25 23:54:08 watson ollama[2409907]: time=2025-05-25T23:54:08.873-04:00 level=INFO source=server.go:135 msg="system memory" total="15.6 GiB" free="14.0 GiB" free_swap="2.1 GiB"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.058-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split=37,12 memory.available="[10.6 GiB 3.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="13.6 GiB" memory.required.partial="13.6 GiB" memory.required.kv="1.3 GiB" memory.required.allocations="[9.8 GiB 3.8 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.058-04:00 level=INFO source=server.go:211 msg="enabling flash attention"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.058-04:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.152-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 4 --flash-attn --parallel 2 --tensor-split 37,12 --port 46521"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.152-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.153-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.153-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.165-04:00 level=INFO source=runner.go:925 msg="starting ollama engine"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.165-04:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:46521"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.256-04:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37
May 25 23:54:09 watson ollama[2409907]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
May 25 23:54:09 watson ollama[2409907]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
May 25 23:54:09 watson ollama[2409907]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
May 25 23:54:09 watson ollama[2409907]: ggml_cuda_init: found 2 CUDA devices:
May 25 23:54:09 watson ollama[2409907]:   Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
May 25 23:54:09 watson ollama[2409907]:   Device 1: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
May 25 23:54:09 watson ollama[2409907]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.392-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.403-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.463-04:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="3.0 GiB"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.463-04:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="787.5 MiB"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.463-04:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="4.6 GiB"
May 25 23:54:09 watson ollama[2409907]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1159.25 MiB on device 1: cudaMalloc failed: out of memory
May 25 23:54:09 watson ollama[2409907]: ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1215561728
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.796-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.796-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="1.1 GiB"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.796-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
May 25 23:54:09 watson ollama[2409907]: panic: insufficient memory - required allocations: {InputWeights:825753600A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} GPUs:[{Name:CUDA0 Weights:[143373568A 143373568A 143373568A 143373568A 143373568A 143373568A 128167168A 128167168A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 126139648A 143373568A 126139648A 126139648A 143373568A 141346048A 141346048A 143373568A 141346048A 141346048A 143373568A 1669154176A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:1215561728F}]}
May 25 23:54:09 watson ollama[2409907]: goroutine 8 [running]:
May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc001e9c0c0)
May 25 23:54:09 watson ollama[2409907]:         github.com/ollama/ollama/ml/backend/ggml/ggml.go:643 +0x756
May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getTensor(0xc00016d9f8?, {0x62f55b23c8f0, 0xc0004b43f0}, {0x62f55b240928, 0xc001e9db40}, {0x62f55b24a968, 0xc000509f50}, 0x1)
May 25 23:54:09 watson ollama[2409907]:         github.com/ollama/ollama/runner/ollamarunner/multimodal.go:98 +0x2a4
May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getMultimodal(0xc00016dcd8, {0x62f55b23c8f0, 0xc0004b43f0}, {0x62f55b240928, 0xc001e9db40}, {0xc000482020, 0x1, 0x30?}, 0x1)
May 25 23:54:09 watson ollama[2409907]:         github.com/ollama/ollama/runner/ollamarunner/multimodal.go:56 +0xe5
May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc00012d8c0)
May 25 23:54:09 watson ollama[2409907]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:796 +0x70e
May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc00012d8c0, {0x7ffcb9ca2bda?, 0x0?}, {0x4, 0x0, 0x31, {0xc00046e878, 0x2, 0x2}, 0x1}, ...)
May 25 23:54:09 watson ollama[2409907]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270
May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc00012d8c0, {0x62f55b238a90, 0xc00013de50}, {0x7ffcb9ca2bda?, 0x0?}, {0x4, 0x0, 0x31, {0xc00046e878, 0x2, ...}, ...}, ...)
May 25 23:54:09 watson ollama[2409907]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8
May 25 23:54:09 watson ollama[2409907]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
May 25 23:54:09 watson ollama[2409907]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.875-04:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.905-04:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1215561728"
May 25 23:54:09 watson ollama[2409907]: [GIN] 2025/05/25 - 23:54:09 | 500 |  2.510247952s |       127.0.0.1 | POST     "/api/chat"
May 25 23:54:14 watson ollama[2409907]: time=2025-05-25T23:54:14.922-04:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.017510731 runner.size="13.6 GiB" runner.vram="13.6 GiB" runner.parallel=2 runner.pid=2363497 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de
May 25 23:54:15 watson ollama[2409907]: time=2025-05-25T23:54:15.173-04:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.267841679 runner.size="13.6 GiB" runner.vram="13.6 GiB" runner.parallel=2 runner.pid=2363497 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de
May 25 23:54:15 watson ollama[2409907]: time=2025-05-25T23:54:15.422-04:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5172213249999995 runner.size="13.6 GiB" runner.vram="13.6 GiB" runner.parallel=2 runner.pid=2363497 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de
<!-- gh-comment-id:2908449005 --> @Master-Pr0grammer commented on GitHub (May 26, 2025): @jessegross ok, I just tried out 0.7.1, and I am still having the same issue, however now it crashes at text inference as well. it might be a different issue since all of the recent fixes, but it definitely seems like something is going wrong here. When I run default `gemma3:12b`, stock settings, default context length (same system of 16Gb VRAM + 16Gb RAM): ``` May 25 23:54:00 watson ollama[2409907]: [GIN] 2025/05/25 - 23:54:00 | 200 | 59.836µs | 127.0.0.1 | GET "/api/version" May 25 23:54:08 watson ollama[2409907]: time=2025-05-25T23:54:08.698-04:00 level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de library=cuda parallel=2 required="13.6 GiB" May 25 23:54:08 watson ollama[2409907]: time=2025-05-25T23:54:08.873-04:00 level=INFO source=server.go:135 msg="system memory" total="15.6 GiB" free="14.0 GiB" free_swap="2.1 GiB" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.058-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split=37,12 memory.available="[10.6 GiB 3.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="13.6 GiB" memory.required.partial="13.6 GiB" memory.required.kv="1.3 GiB" memory.required.allocations="[9.8 GiB 3.8 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.058-04:00 level=INFO source=server.go:211 msg="enabling flash attention" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.058-04:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.152-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --threads 4 --flash-attn --parallel 2 --tensor-split 37,12 --port 46521" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.152-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.153-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.153-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.165-04:00 level=INFO source=runner.go:925 msg="starting ollama engine" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.165-04:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:46521" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.256-04:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37 May 25 23:54:09 watson ollama[2409907]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so May 25 23:54:09 watson ollama[2409907]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no May 25 23:54:09 watson ollama[2409907]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no May 25 23:54:09 watson ollama[2409907]: ggml_cuda_init: found 2 CUDA devices: May 25 23:54:09 watson ollama[2409907]: Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes May 25 23:54:09 watson ollama[2409907]: Device 1: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes May 25 23:54:09 watson ollama[2409907]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.392-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.403-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.463-04:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="3.0 GiB" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.463-04:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="787.5 MiB" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.463-04:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="4.6 GiB" May 25 23:54:09 watson ollama[2409907]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1159.25 MiB on device 1: cudaMalloc failed: out of memory May 25 23:54:09 watson ollama[2409907]: ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1215561728 May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.796-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.796-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="1.1 GiB" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.796-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" May 25 23:54:09 watson ollama[2409907]: panic: insufficient memory - required allocations: {InputWeights:825753600A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} GPUs:[{Name:CUDA0 Weights:[143373568A 143373568A 143373568A 143373568A 143373568A 143373568A 128167168A 128167168A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 126139648A 143373568A 126139648A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 126139648A 143373568A 126139648A 126139648A 143373568A 141346048A 141346048A 143373568A 141346048A 141346048A 143373568A 1669154176A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:1215561728F}]} May 25 23:54:09 watson ollama[2409907]: goroutine 8 [running]: May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc001e9c0c0) May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/ml/backend/ggml/ggml.go:643 +0x756 May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getTensor(0xc00016d9f8?, {0x62f55b23c8f0, 0xc0004b43f0}, {0x62f55b240928, 0xc001e9db40}, {0x62f55b24a968, 0xc000509f50}, 0x1) May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner/multimodal.go:98 +0x2a4 May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getMultimodal(0xc00016dcd8, {0x62f55b23c8f0, 0xc0004b43f0}, {0x62f55b240928, 0xc001e9db40}, {0xc000482020, 0x1, 0x30?}, 0x1) May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner/multimodal.go:56 +0xe5 May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc00012d8c0) May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner/runner.go:796 +0x70e May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc00012d8c0, {0x7ffcb9ca2bda?, 0x0?}, {0x4, 0x0, 0x31, {0xc00046e878, 0x2, 0x2}, 0x1}, ...) May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270 May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc00012d8c0, {0x62f55b238a90, 0xc00013de50}, {0x7ffcb9ca2bda?, 0x0?}, {0x4, 0x0, 0x31, {0xc00046e878, 0x2, ...}, ...}, ...) May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8 May 25 23:54:09 watson ollama[2409907]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 May 25 23:54:09 watson ollama[2409907]: github.com/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11 May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.875-04:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" May 25 23:54:09 watson ollama[2409907]: time=2025-05-25T23:54:09.905-04:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1215561728" May 25 23:54:09 watson ollama[2409907]: [GIN] 2025/05/25 - 23:54:09 | 500 | 2.510247952s | 127.0.0.1 | POST "/api/chat" May 25 23:54:14 watson ollama[2409907]: time=2025-05-25T23:54:14.922-04:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.017510731 runner.size="13.6 GiB" runner.vram="13.6 GiB" runner.parallel=2 runner.pid=2363497 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de May 25 23:54:15 watson ollama[2409907]: time=2025-05-25T23:54:15.173-04:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.267841679 runner.size="13.6 GiB" runner.vram="13.6 GiB" runner.parallel=2 runner.pid=2363497 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de May 25 23:54:15 watson ollama[2409907]: time=2025-05-25T23:54:15.422-04:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5172213249999995 runner.size="13.6 GiB" runner.vram="13.6 GiB" runner.parallel=2 runner.pid=2363497 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de ```
Author
Owner

@jessegross commented on GitHub (Jun 16, 2025):

There is an early preview of Ollama's new memory management with the goal of comprehensively fixing these issues. It is still in development, however, if you want to compile from source and try it out, you can find it here: https://github.com/ollama/ollama/pull/11090

Please leave any feedback on that PR.

<!-- gh-comment-id:2978257990 --> @jessegross commented on GitHub (Jun 16, 2025): There is an early preview of Ollama's new memory management with the goal of comprehensively fixing these issues. It is still in development, however, if you want to compile from source and try it out, you can find it here: https://github.com/ollama/ollama/pull/11090 Please leave any feedback on that PR.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32346