[GH-ISSUE #11921] Ollama 0.11.5-RC2: Linux Nvidia / Cuda: Error: 500 Internal Server Error when OLLAMA_NEW_ESTIMATES=1 is not set! #54426

Closed
opened 2026-04-29 05:55:45 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @dan-and on GitHub (Aug 15, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11921

What is the issue?

I run a multi GPU server with CUDA 13 on Ubuntu 22.04 LTS:

After upgrading from 0.11.4 to 0.11.5-RC2 I get an 500 Internal Server Error, when the default configuration is set.

After activating OLLAMA_NEW_ESTIMATES=1 ( New Memory Management add by @jessegross in #11090 ),
it works fine.

So something with the default memory management got broken on the way.

ollama.log

Environment Envs:

Environment="OLLAMA_CONTEXT_LENGTH=4096"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_SCHED_SPREAD=0"
Environment="OLLAMA_NEW_ENGINE=1"
Environment="OLLAMA_MAX_LOADED_MODELS=4"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_LOAD_TIMEOUT=15m0s"
Environment="OLLAMA_DEBUG=1"

Relevant log output

$ ollama run qwen3:30b --verbose "tell me a story" 
Error: 500 Internal Server Error: llama runner process has terminated: exit status 2

Logs:  (See attached file for more)
Aug 15 13:55:21 gpu ollama[6087]: load_tensors: tensor 'token_embd.weight' (q4_K) (and 69 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
Aug 15 13:55:22 gpu ollama[6087]: load_tensors: offloading 48 repeating layers to GPU
Aug 15 13:55:22 gpu ollama[6087]: load_tensors: offloading output layer to GPU
Aug 15 13:55:22 gpu ollama[6087]: load_tensors: offloaded 49/49 layers to GPU
Aug 15 13:55:22 gpu ollama[6087]: load_tensors:        CUDA0 model buffer size =   134.73 MiB
Aug 15 13:55:22 gpu ollama[6087]: load_tensors:        CUDA1 model buffer size =   368.06 MiB
Aug 15 13:55:22 gpu ollama[6087]: load_tensors:        CUDA0 model buffer size =  4754.91 MiB
Aug 15 13:55:22 gpu ollama[6087]: load_tensors:        CUDA1 model buffer size =  4220.73 MiB
Aug 15 13:55:22 gpu ollama[6087]: load_tensors:   CPU_Mapped model buffer size =  8705.57 MiB
Aug 15 13:55:22 gpu ollama[6087]: SIGSEGV: segmentation violation
Aug 15 13:55:22 gpu ollama[6087]: PC=0x0 m=5 sigcode=1 addr=0x0
Aug 15 13:55:22 gpu ollama[6087]: signal arrived during cgo execution
Aug 15 13:55:22 gpu ollama[6087]: goroutine 23 gp=0xc000102a80 m=5 mp=0xc000100008 [syscall]:

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

11.5-rc2

Originally created by @dan-and on GitHub (Aug 15, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11921 ### What is the issue? I run a multi GPU server with CUDA 13 on Ubuntu 22.04 LTS: After upgrading from 0.11.4 to 0.11.5-RC2 I get an 500 Internal Server Error, when the default configuration is set. After activating OLLAMA_NEW_ESTIMATES=1 ( New Memory Management add by @jessegross in #11090 ), it works fine. So something with the default memory management got broken on the way. [ollama.log](https://github.com/user-attachments/files/21797759/ollama.log) ### Environment Envs: ``` Environment="OLLAMA_CONTEXT_LENGTH=4096" Environment="OLLAMA_NUM_PARALLEL=1" Environment="OLLAMA_SCHED_SPREAD=0" Environment="OLLAMA_NEW_ENGINE=1" Environment="OLLAMA_MAX_LOADED_MODELS=4" Environment="OLLAMA_KEEP_ALIVE=30m" Environment="OLLAMA_LOAD_TIMEOUT=15m0s" Environment="OLLAMA_DEBUG=1" ``` ### Relevant log output ```shell $ ollama run qwen3:30b --verbose "tell me a story" Error: 500 Internal Server Error: llama runner process has terminated: exit status 2 Logs: (See attached file for more) Aug 15 13:55:21 gpu ollama[6087]: load_tensors: tensor 'token_embd.weight' (q4_K) (and 69 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead Aug 15 13:55:22 gpu ollama[6087]: load_tensors: offloading 48 repeating layers to GPU Aug 15 13:55:22 gpu ollama[6087]: load_tensors: offloading output layer to GPU Aug 15 13:55:22 gpu ollama[6087]: load_tensors: offloaded 49/49 layers to GPU Aug 15 13:55:22 gpu ollama[6087]: load_tensors: CUDA0 model buffer size = 134.73 MiB Aug 15 13:55:22 gpu ollama[6087]: load_tensors: CUDA1 model buffer size = 368.06 MiB Aug 15 13:55:22 gpu ollama[6087]: load_tensors: CUDA0 model buffer size = 4754.91 MiB Aug 15 13:55:22 gpu ollama[6087]: load_tensors: CUDA1 model buffer size = 4220.73 MiB Aug 15 13:55:22 gpu ollama[6087]: load_tensors: CPU_Mapped model buffer size = 8705.57 MiB Aug 15 13:55:22 gpu ollama[6087]: SIGSEGV: segmentation violation Aug 15 13:55:22 gpu ollama[6087]: PC=0x0 m=5 sigcode=1 addr=0x0 Aug 15 13:55:22 gpu ollama[6087]: signal arrived during cgo execution Aug 15 13:55:22 gpu ollama[6087]: goroutine 23 gp=0xc000102a80 m=5 mp=0xc000100008 [syscall]: ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 11.5-rc2
GiteaMirror added the bug label 2026-04-29 05:55:45 -05:00
Author
Owner

@jessegross commented on GitHub (Aug 15, 2025):

If possible, can you post the following logs:

  • 0.11.4 with it working using the same settings you have here
  • 0.11.5-rc2 with OLLAMA_NEW_ENGINE set
  • 0.11.5-rc2 with OLLAMA_NEW_ENGINE and OLLAMA_NEW_ESTIMATES set
<!-- gh-comment-id:3192151854 --> @jessegross commented on GitHub (Aug 15, 2025): If possible, can you post the following logs: - 0.11.4 with it working using the same settings you have here - 0.11.5-rc2 with OLLAMA_NEW_ENGINE set - 0.11.5-rc2 with OLLAMA_NEW_ENGINE and OLLAMA_NEW_ESTIMATES set
Author
Owner

@jessegross commented on GitHub (Aug 15, 2025):

Actually, it looks like you have old and new versions installed on top of each other. You'll need to remove the old version and install fresh. This is the same as https://github.com/ollama/ollama/issues/11211

<!-- gh-comment-id:3192593471 --> @jessegross commented on GitHub (Aug 15, 2025): Actually, it looks like you have old and new versions installed on top of each other. You'll need to remove the old version and install fresh. This is the same as https://github.com/ollama/ollama/issues/11211
Author
Owner

@dan-and commented on GitHub (Aug 15, 2025):

I have gone through all variants, and you are right (again):
Regardless of 0.11.4 or 0.12.5-rc2, I need to have OLLAMA_NEW_ENGINE set to 1 to work as intended. It is the same issue as #11211 , so I will close this issue.

<!-- gh-comment-id:3192673408 --> @dan-and commented on GitHub (Aug 15, 2025): I have gone through all variants, and you are right (again): Regardless of 0.11.4 or 0.12.5-rc2, I need to have OLLAMA_NEW_ENGINE set to 1 to work as intended. It is the same issue as #11211 , so I will close this issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54426