[GH-ISSUE #14241] ibm/granite4:1b #55786

Closed
opened 2026-04-29 09:44:01 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @Weilando on GitHub (Feb 13, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14241

What is the issue?

After sending a message to the ibm/granite4:1b model, I receive an internal server error:

Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details

This error occurs when using the CLI, the GUI and the python package. I have tried various quantizations (e.g., q4_K_M and q2_K) and have already reinstalled Ollama and the model. Other models from the ibm/granite4 family work fine, e.g., ibm/granite4:micro-q4_K_S.

It seems like the error occurs before any useful log messages appear, but I provide the server logs anyway. I am not sure if it is an Ollama related problem or if it arises from llama.cpp or the model itself. Any suggestions are appreciated!

Relevant log output

time=2026-02-13T21:56:31.952+01:00 level=INFO source=routes.go:1636 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:8192 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/.../.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]"
time=2026-02-13T21:56:31.953+01:00 level=INFO source=images.go:473 msg="total blobs: 0"
time=2026-02-13T21:56:31.953+01:00 level=INFO source=images.go:480 msg="total unused blobs removed: 0"
time=2026-02-13T21:56:31.953+01:00 level=INFO source=routes.go:1689 msg="Listening on [::]:11434 (version 0.16.1)"
time=2026-02-13T21:56:31.954+01:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-02-13T21:56:31.955+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --port 51440"
time=2026-02-13T21:56:32.092+01:00 level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=Metal compute=0.0 name=Metal description="Apple M1" libdirs="" driver=0.0 pci_id="" type=discrete total="11.8 GiB" available="11.8 GiB"
time=2026-02-13T21:56:32.092+01:00 level=INFO source=routes.go:1739 msg="vram-based default context" total_vram="11.8 GiB" default_num_ctx=4096

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.16.1

Originally created by @Weilando on GitHub (Feb 13, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14241 ### What is the issue? After sending a message to the ibm/granite4:1b model, I receive an internal server error: > Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details This error occurs when using the CLI, the GUI and the python package. I have tried various quantizations (e.g., q4_K_M and q2_K) and have already reinstalled Ollama and the model. Other models from the ibm/granite4 family work fine, e.g., ibm/granite4:micro-q4_K_S. It seems like the error occurs before any useful log messages appear, but I provide the server logs anyway. I am not sure if it is an Ollama related problem or if it arises from llama.cpp or the model itself. Any suggestions are appreciated! ### Relevant log output ```shell time=2026-02-13T21:56:31.952+01:00 level=INFO source=routes.go:1636 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:8192 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/.../.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]" time=2026-02-13T21:56:31.953+01:00 level=INFO source=images.go:473 msg="total blobs: 0" time=2026-02-13T21:56:31.953+01:00 level=INFO source=images.go:480 msg="total unused blobs removed: 0" time=2026-02-13T21:56:31.953+01:00 level=INFO source=routes.go:1689 msg="Listening on [::]:11434 (version 0.16.1)" time=2026-02-13T21:56:31.954+01:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-02-13T21:56:31.955+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --port 51440" time=2026-02-13T21:56:32.092+01:00 level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=Metal compute=0.0 name=Metal description="Apple M1" libdirs="" driver=0.0 pci_id="" type=discrete total="11.8 GiB" available="11.8 GiB" time=2026-02-13T21:56:32.092+01:00 level=INFO source=routes.go:1739 msg="vram-based default context" total_vram="11.8 GiB" default_num_ctx=4096 ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.16.1
GiteaMirror added the bug label 2026-04-29 09:44:01 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 13, 2026):

There should be more log than this, at the very least there should be a GIN log line showing the HTTP failure.

<!-- gh-comment-id:3899662976 --> @rick-github commented on GitHub (Feb 13, 2026): There should be more log than this, at the very least there should be a GIN log line showing the HTTP failure.
Author
Owner

@rick-github commented on GitHub (Feb 13, 2026):

$ ollama run ibm/granite4:1b
>>> hello
Hello! How may I assist you today?
<!-- gh-comment-id:3899667236 --> @rick-github commented on GitHub (Feb 13, 2026): ```console $ ollama run ibm/granite4:1b >>> hello Hello! How may I assist you today? ```
Author
Owner

@Weilando commented on GitHub (Feb 14, 2026):

Thanks for the quick reply! Unfortunately, none of my server logs contains more information than the one provided. Is there any hidden setting or configuration that disables certain log levels?

However, the tag "ibm/granite4:1b" from your example works. I conclude from the size of 3.3GB that it is the "ibm/granite4:1b-f16", i.e., without quantization. May you try one of the quantized models on your system, please? Maybe there is a hiccup with the quantization.

<!-- gh-comment-id:3901367789 --> @Weilando commented on GitHub (Feb 14, 2026): Thanks for the quick reply! Unfortunately, none of my server logs contains more information than the one provided. Is there any hidden setting or configuration that disables certain log levels? However, the tag "ibm/granite4:1b" from your example works. I conclude from the size of 3.3GB that it is the "ibm/granite4:1b-f16", i.e., without quantization. May you try one of the quantized models on your system, please? Maybe there is a hiccup with the quantization.
Author
Owner

@rick-github commented on GitHub (Feb 15, 2026):

$ for i in ibm/granite4:1b-{q2_K,q4_K_M} ; do echo $i $(ollama run $i hello) ; done
ibm/granite4:1b-q2_K Hello! How can I assist you today?
ibm/granite4:1b-q4_K_M Hello! How can I assist you today? If you have any questions or need information on a particular topic, feel free to ask.

There are no hidden settings that would disable GIN entries in the log, or information about the loading, unloading and failure of models. The log you have posted is from the ollama server starting, discovering GPUs, and then waiting for instructions on what model to load. If you then run the granite model and it fails, there should be information in the log about the failure.

<!-- gh-comment-id:3904723420 --> @rick-github commented on GitHub (Feb 15, 2026): ```console $ for i in ibm/granite4:1b-{q2_K,q4_K_M} ; do echo $i $(ollama run $i hello) ; done ibm/granite4:1b-q2_K Hello! How can I assist you today? ibm/granite4:1b-q4_K_M Hello! How can I assist you today? If you have any questions or need information on a particular topic, feel free to ask. ``` There are no hidden settings that would disable `GIN` entries in the log, or information about the loading, unloading and failure of models. The log you have posted is from the ollama server starting, discovering GPUs, and then waiting for instructions on what model to load. If you then run the granite model and it fails, there should be information in the log about the failure.
Author
Owner

@Weilando commented on GitHub (Feb 15, 2026):

Thanks for trying some quantized models! After uninstalling Ollama again and also cleaning remnants in the Application Support folder, I get some helpful logs.

When running "ibm/granite4:1b-q2_K", it seems like the model server starts successfully, but then the function llama_sampler_dist_apply cannot be found in file llama-sampling.cpp. Here is the shortened stack trace (skipping a lot of go routines, but I can provide the full stack trace if required). Might this be a bug or is there something missing on my machine? I have installed Ollama via curl -fsSL https://ollama.com/install.sh | sh and some other models work fine.

...
time=2026-02-15T23:06:49.675+01:00 level=INFO source=server.go:1388 msg="llama runner started in 0.38 seconds"
[GIN] 2026/02/15 - 23:06:49 | 200 |    611.8225ms |       127.0.0.1 | POST     "/api/generate"
Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 660.
SIGABRT: abort
PC=0x19c6615b0 m=7 sigcode=0
signal arrived during cgo execution

goroutine 36 gp=0x14000102a80 m=7 mp=0x14000500008 [syscall]:
runtime.cgocall(0x102eab5d4, 0x14000083c48)
	/Users/runner/hostedtoolcache/go/1.24.1/arm64/src/runtime/cgocall.go:167 +0x44 fp=0x14000083c00 sp=0x14000083bc0 pc=0x1021c35c4
github.com/ollama/ollama/llama._Cfunc_common_sampler_csample(0x104a616c0, 0x96aaac000, 0x22)
	_cgo_gotypes.go:425 +0x34 fp=0x14000083c40 sp=0x14000083c00 pc=0x1025de264
github.com/ollama/ollama/runner/llamarunner.(*Server).processBatch.(*SamplingContext).Sample.func1(...)
	/Users/runner/work/ollama/ollama/llama/llama.go:679
github.com/ollama/ollama/llama.(*SamplingContext).Sample(...)
	/Users/runner/work/ollama/ollama/llama/llama.go:679
github.com/ollama/ollama/runner/llamarunner.(*Server).processBatch(0x14000322140, 0x140000e1950, 0x14000083f18)
	/Users/runner/work/ollama/ollama/runner/llamarunner/runner.go:539 +0x510 fp=0x14000083ed0 sp=0x14000083c40 pc=0x10267b510
github.com/ollama/ollama/runner/llamarunner.(*Server).run(0x14000322140, {0x10372a800, 0x14000160b90})
	/Users/runner/work/ollama/ollama/runner/llamarunner/runner.go:387 +0x164 fp=0x14000083fa0 sp=0x14000083ed0 pc=0x10267ae94
github.com/ollama/ollama/runner/llamarunner.Execute.gowrap1()
	/Users/runner/work/ollama/ollama/runner/llamarunner/runner.go:981 +0x30 fp=0x14000083fd0 sp=0x14000083fa0 pc=0x10267f3a0
runtime.goexit({})
	/Users/runner/hostedtoolcache/go/1.24.1/arm64/src/runtime/asm_arm64.s:1223 +0x4 fp=0x14000083fd0 sp=0x14000083fd0 pc=0x1021ceed4
created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1
	/Users/runner/work/ollama/ollama/runner/llamarunner/runner.go:981 +0x44c

...

time=2026-02-15T23:06:51.681+01:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:50580/completion\": EOF"
[GIN] 2026/02/15 - 23:06:51 | 500 |  297.182083ms |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:3905284768 --> @Weilando commented on GitHub (Feb 15, 2026): Thanks for trying some quantized models! After uninstalling Ollama again and also cleaning remnants in the Application Support folder, I get some helpful logs. When running "ibm/granite4:1b-q2_K", it seems like the model server starts successfully, but then the function `llama_sampler_dist_apply` cannot be found in file `llama-sampling.cpp`. Here is the shortened stack trace (skipping a lot of go routines, but I can provide the full stack trace if required). Might this be a bug or is there something missing on my machine? I have installed Ollama via `curl -fsSL https://ollama.com/install.sh | sh` and some other models work fine. ``` ... time=2026-02-15T23:06:49.675+01:00 level=INFO source=server.go:1388 msg="llama runner started in 0.38 seconds" [GIN] 2026/02/15 - 23:06:49 | 200 | 611.8225ms | 127.0.0.1 | POST "/api/generate" Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 660. SIGABRT: abort PC=0x19c6615b0 m=7 sigcode=0 signal arrived during cgo execution goroutine 36 gp=0x14000102a80 m=7 mp=0x14000500008 [syscall]: runtime.cgocall(0x102eab5d4, 0x14000083c48) /Users/runner/hostedtoolcache/go/1.24.1/arm64/src/runtime/cgocall.go:167 +0x44 fp=0x14000083c00 sp=0x14000083bc0 pc=0x1021c35c4 github.com/ollama/ollama/llama._Cfunc_common_sampler_csample(0x104a616c0, 0x96aaac000, 0x22) _cgo_gotypes.go:425 +0x34 fp=0x14000083c40 sp=0x14000083c00 pc=0x1025de264 github.com/ollama/ollama/runner/llamarunner.(*Server).processBatch.(*SamplingContext).Sample.func1(...) /Users/runner/work/ollama/ollama/llama/llama.go:679 github.com/ollama/ollama/llama.(*SamplingContext).Sample(...) /Users/runner/work/ollama/ollama/llama/llama.go:679 github.com/ollama/ollama/runner/llamarunner.(*Server).processBatch(0x14000322140, 0x140000e1950, 0x14000083f18) /Users/runner/work/ollama/ollama/runner/llamarunner/runner.go:539 +0x510 fp=0x14000083ed0 sp=0x14000083c40 pc=0x10267b510 github.com/ollama/ollama/runner/llamarunner.(*Server).run(0x14000322140, {0x10372a800, 0x14000160b90}) /Users/runner/work/ollama/ollama/runner/llamarunner/runner.go:387 +0x164 fp=0x14000083fa0 sp=0x14000083ed0 pc=0x10267ae94 github.com/ollama/ollama/runner/llamarunner.Execute.gowrap1() /Users/runner/work/ollama/ollama/runner/llamarunner/runner.go:981 +0x30 fp=0x14000083fd0 sp=0x14000083fa0 pc=0x10267f3a0 runtime.goexit({}) /Users/runner/hostedtoolcache/go/1.24.1/arm64/src/runtime/asm_arm64.s:1223 +0x4 fp=0x14000083fd0 sp=0x14000083fd0 pc=0x1021ceed4 created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1 /Users/runner/work/ollama/ollama/runner/llamarunner/runner.go:981 +0x44c ... time=2026-02-15T23:06:51.681+01:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:50580/completion\": EOF" [GIN] 2026/02/15 - 23:06:51 | 500 | 297.182083ms | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@rick-github commented on GitHub (Feb 17, 2026):

The error indicates that a forward pass didn't generate a token with a high enough probability. What was the prompt?

<!-- gh-comment-id:3914299047 --> @rick-github commented on GitHub (Feb 17, 2026): The error indicates that a forward pass didn't generate a token with a high enough probability. What was the prompt?
Author
Owner

@Weilando commented on GitHub (Feb 17, 2026):

I have used some simple prompts like Write a short joke. and also hello (which you have used in your call above). Longer prompts from a project also lead to the internal server error.

<!-- gh-comment-id:3916784579 --> @Weilando commented on GitHub (Feb 17, 2026): I have used some simple prompts like `Write a short joke.` and also `hello` (which you have used in your call above). Longer prompts from a project also lead to the internal server error.
Author
Owner

@rick-github commented on GitHub (Feb 18, 2026):

Might be a Mac thing. I've tried the models on AMD, Nvidia and plain CPU without any problems. The quants are not listed on the official granite4 page, perhaps they were unreliable.

<!-- gh-comment-id:3918232623 --> @rick-github commented on GitHub (Feb 18, 2026): Might be a Mac thing. I've tried the models on AMD, Nvidia and plain CPU without any problems. The quants are not listed on the official [granite4](https://ollama.com/library/granite4/tags) page, perhaps they were unreliable.
Author
Owner

@Weilando commented on GitHub (Feb 22, 2026):

Thanks for your efforts, anyway! Is there any action that we should take from here (e.g., raising an issue elsewhere or documenting the potential problems with the quants)?

<!-- gh-comment-id:3941656248 --> @Weilando commented on GitHub (Feb 22, 2026): Thanks for your efforts, anyway! Is there any action that we should take from here (e.g., raising an issue elsewhere or documenting the potential problems with the quants)?
Author
Owner

@rick-github commented on GitHub (Feb 22, 2026):

Unfortunately there's no mechanism for giving feedback to user uploaded models. You could try filing an issue in their GH repo, https://github.com/ibm-granite/granite-4.0-language-models.

<!-- gh-comment-id:3941678816 --> @rick-github commented on GitHub (Feb 22, 2026): Unfortunately there's no mechanism for giving feedback to user uploaded models. You could try filing an issue in their GH repo, https://github.com/ibm-granite/granite-4.0-language-models.
Author
Owner

@Weilando commented on GitHub (Feb 22, 2026):

Ok, I am going to do so. Thanks!

<!-- gh-comment-id:3941703611 --> @Weilando commented on GitHub (Feb 22, 2026): Ok, I am going to do so. Thanks!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55786