[GH-ISSUE #7441] Error: unknown error was encountered while running the model GGML_ASSERT(i01 >= 0 && i01 < ne01) failed #30490

New Issue

GiteaMirror · 2026-04-22T10:08:45-05:00

GiteaMirror commented

2026-04-22 10:08:45 -05:00

Originally created by @kalcao on GitHub (Oct 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7441

What is the issue?

Nanollava returns GGML_ASSERT(i01 >= 0 && i01 < ne01) failed error on chat with image

Output:

ubuntu@ubuntu:~/workspace$ ollama run qnguyen3/nanollava "tell me what do you see in this picture? ./sample.jpg"
Added image './sample.jpg'
Error: an unknown error was encountered while running the model GGML_ASSERT(i01 >= 0 && i01 < ne01) failed

OS

Linux

GPU

Other

CPU

AMD

Ollama version

0.3.14

Originally created by @kalcao on GitHub (Oct 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7441 ### What is the issue? [Nanollava](https://ollama.com/qnguyen3/nanollava) returns `GGML_ASSERT(i01 >= 0 && i01 < ne01) failed` error on chat with image Output: ```bash ubuntu@ubuntu:~/workspace$ ollama run qnguyen3/nanollava "tell me what do you see in this picture? ./sample.jpg" Added image './sample.jpg' Error: an unknown error was encountered while running the model GGML_ASSERT(i01 >= 0 && i01 < ne01) failed ``` ### OS Linux ### GPU Other ### CPU AMD ### Ollama version 0.3.14

GiteaMirror added the bug label 2026-04-22 10:08:45 -05:00

GiteaMirror closed this issue

2026-04-22 10:08:46 -05:00

GiteaMirror commented

2026-04-22 10:08:48 -05:00

@rick-github commented on GitHub (Oct 31, 2024):

Server logs may aid in debugging. It would be helpful to also have a copy of the image that caused the failure.

$ ollama run qnguyen3/nanollava "tell me what do you see in this picture? ./puppy.jpg"
Added image './puppy.jpg'
This image showcases a domestic scene of a small white puppy with black eyes, standing on a concrete
ledge. The puppy appears  to be crouching down, looking over the edge of the ledge. The environment
seems to have a stone floor or steps nearby, and there's a cat resting near the edge of a post. A red collar
around the puppy's neck is also visible.
$ ollama -v
ollama version is 0.3.14

@rick-github commented on GitHub (Oct 31, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging. It would be helpful to also have a copy of the image that caused the failure. ```console $ ollama run qnguyen3/nanollava "tell me what do you see in this picture? ./puppy.jpg" Added image './puppy.jpg' This image showcases a domestic scene of a small white puppy with black eyes, standing on a concrete ledge. The puppy appears to be crouching down, looking over the edge of the ledge. The environment seems to have a stone floor or steps nearby, and there's a cat resting near the edge of a post. A red collar around the puppy's neck is also visible. $ ollama -v ollama version is 0.3.14 ```

GiteaMirror commented

2026-04-22 10:08:49 -05:00

@kalcao commented on GitHub (Nov 1, 2024):

I have uploaded the logs to https://hastebin.skyra.pw/edinabafos.prolog
The python code I used for it is

import ollama

res = ollama.chat(
	model="qnguyen3/nanollava:latest",
	messages=[
		{
			'role': 'user',
			'content': 'Describe this image:',
			'images': ['./image.png']
		}
	]
)

print(res['message']['content'])

It's the picture what I have tried. I tried with other pictures but still getting the same error

@kalcao commented on GitHub (Nov 1, 2024): I have uploaded the logs to https://hastebin.skyra.pw/edinabafos.prolog The python code I used for it is ```py import ollama res = ollama.chat( model="qnguyen3/nanollava:latest", messages=[ { 'role': 'user', 'content': 'Describe this image:', 'images': ['./image.png'] } ] ) print(res['message']['content']) ``` ![image](https://github.com/user-attachments/assets/73832b86-d277-47fc-b47e-4b4a7da4817a) It's the picture what I have tried. I tried with other pictures but still getting the same error

GiteaMirror commented

2026-04-22 10:08:50 -05:00

@rick-github commented on GitHub (Nov 1, 2024):

This affects CPU based runners (cpu, cpu_avx, cpu_axv2) from 0.3.14 onwards. Earlier versions work fine, as does CUDA based runners in all versions through to 0.4.0-rc6. ROCm and Metal untested.

$ curl localhost:11434/api/version
{"version":"0.3.13"}
$ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq
{
  "model": "qnguyen3/nanollava",
  "created_at": "2024-11-01T22:29:39.812672017Z",
  "message": {
    "role": "assistant",
    "content": "An outdoor scene of a river with blue water."
  },
  "done_reason": "stop",
  "done": true,
  "total_duration": 4992363967,
  "load_duration": 1942686202,
  "prompt_eval_duration": 2707665000,
  "eval_count": 11,
  "eval_duration": 272003000
}

$ curl localhost:11434/api/version
{"version":"0.3.14"}
$ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq
{
  "error": "an unknown error was encountered while running the model GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"
}

$ curl localhost:11434/api/version
{"version":"0.4.0-rc6"}
$ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq
{
  "error": "POST predict: Post \"http://127.0.0.1:32925/completion\": EOF"
}

Other llava based models appear unaffected:

$ curl localhost:11434/api/version
{"version":"0.3.14"}
$ (echo '{"model":"llava:7b-v1.5-q4_K_M","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq .message.content
" The image captures a serene scene of a large lake, surrounded by trees and hills. A beautiful mountain stream flows into the lake from a higher elevation. The tranquil water reflects the landscape, creating an idyllic setting for relaxation or leisure activities.\n\nA small plant is visible near the water's edge, adding to the natural beauty of the scene. In the background, there are several cars parked at various distances from each other, potentially belonging to visitors enjoying the view and tranquility that this picturesque landscape offers."
$ (echo '{"model":"llava-llama3:8b-v1.1-q4_0","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq .message.content
"The image captures a serene scene of a lake at sunset. The vantage point is from the shore, looking out towards the calm expanse of water that stretches out to a distant shore adorned with trees and hills. The sky above is painted in hues of blue, gradually transitioning into a warm orange as it meets the horizon. The sun is partially obscured by the hillside in the distance, casting long shadows across the lake's surface.\n\nIn the foreground on the right side of the image, there are small plants peeking out from the ground, adding a touch of green to the scene. On the left side of the image, there's a small boat moored near the shore, perhaps indicating that this is a place where people come to enjoy the tranquility of the lake.\n\nThe overall atmosphere conveyed by the image is one of peace and quietude, as if time itself has slowed down in this particular corner of the world. The precise location of each object - from the boat on the left to the plants on the right, and the hills beyond the water - adds depth to the image, creating a sense of distance and scale.\n\nThere's no text visible in the image, reinforcing the impression that this is a place untouched by modernity or technology. The relative positions of the objects suggest a well-balanced composition, with each element contributing to the overall harmony of the scene. The image doesn't just show what can be seen; it tells a story of a quiet moment frozen in time, a snapshot of nature's beauty undisturbed."

The assert happens at this line: 8a9bb0d000/llama/ggml.c (L13425)
I believe this is a result of f2890a4494, which added support for the granite models and as a side effect, bumped to a new version of llama.cpp.

This probably needs to be logged with llama.cpp. In the meantime, you can either rollback to 0.3.13, try a different model, or get a GPU.

@rick-github commented on GitHub (Nov 1, 2024): This affects CPU based runners (cpu, cpu_avx, cpu_axv2) from 0.3.14 onwards. Earlier versions work fine, as does CUDA based runners in all versions through to 0.4.0-rc6. ROCm and Metal untested. ```console $ curl localhost:11434/api/version {"version":"0.3.13"} $ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq { "model": "qnguyen3/nanollava", "created_at": "2024-11-01T22:29:39.812672017Z", "message": { "role": "assistant", "content": "An outdoor scene of a river with blue water." }, "done_reason": "stop", "done": true, "total_duration": 4992363967, "load_duration": 1942686202, "prompt_eval_duration": 2707665000, "eval_count": 11, "eval_duration": 272003000 } ``` ```console $ curl localhost:11434/api/version {"version":"0.3.14"} $ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq { "error": "an unknown error was encountered while running the model GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" } ``` ```console $ curl localhost:11434/api/version {"version":"0.4.0-rc6"} $ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq { "error": "POST predict: Post \"http://127.0.0.1:32925/completion\": EOF" } ``` Other llava based models appear unaffected: ```console $ curl localhost:11434/api/version {"version":"0.3.14"} $ (echo '{"model":"llava:7b-v1.5-q4_K_M","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq .message.content " The image captures a serene scene of a large lake, surrounded by trees and hills. A beautiful mountain stream flows into the lake from a higher elevation. The tranquil water reflects the landscape, creating an idyllic setting for relaxation or leisure activities.\n\nA small plant is visible near the water's edge, adding to the natural beauty of the scene. In the background, there are several cars parked at various distances from each other, potentially belonging to visitors enjoying the view and tranquility that this picturesque landscape offers." $ (echo '{"model":"llava-llama3:8b-v1.1-q4_0","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq .message.content "The image captures a serene scene of a lake at sunset. The vantage point is from the shore, looking out towards the calm expanse of water that stretches out to a distant shore adorned with trees and hills. The sky above is painted in hues of blue, gradually transitioning into a warm orange as it meets the horizon. The sun is partially obscured by the hillside in the distance, casting long shadows across the lake's surface.\n\nIn the foreground on the right side of the image, there are small plants peeking out from the ground, adding a touch of green to the scene. On the left side of the image, there's a small boat moored near the shore, perhaps indicating that this is a place where people come to enjoy the tranquility of the lake.\n\nThe overall atmosphere conveyed by the image is one of peace and quietude, as if time itself has slowed down in this particular corner of the world. The precise location of each object - from the boat on the left to the plants on the right, and the hills beyond the water - adds depth to the image, creating a sense of distance and scale.\n\nThere's no text visible in the image, reinforcing the impression that this is a place untouched by modernity or technology. The relative positions of the objects suggest a well-balanced composition, with each element contributing to the overall harmony of the scene. The image doesn't just show what can be seen; it tells a story of a quiet moment frozen in time, a snapshot of nature's beauty undisturbed." ``` The assert happens at this line: https://github.com/ollama/ollama/blob/8a9bb0d000ae8201445ef1a590d7136df0a16f8b/llama/ggml.c#L13425 I believe this is a result of https://github.com/ollama/ollama/commit/f2890a4494f9fb3722ee7a4c506252362d1eab65, which added support for the granite models and as a side effect, bumped to a new version of llama.cpp. This probably needs to be logged with [llama.cpp](https://github.com/ggerganov/llama.cpp/issues). In the meantime, you can either rollback to 0.3.13, try a different model, or get a GPU.

GiteaMirror commented

2026-04-22 10:08:51 -05:00

@kalcao commented on GitHub (Nov 2, 2024):

I have installed 0.3.13 and worked fine. Really thank you for the help!

@kalcao commented on GitHub (Nov 2, 2024): I have installed 0.3.13 and worked fine. Really thank you for the help!

GiteaMirror commented

2026-04-22 10:08:52 -05:00

@jessegross commented on GitHub (Nov 4, 2024):

We should keep this open to track the issue, even if the fix is ultimately in llama.cpp

@jessegross commented on GitHub (Nov 4, 2024): We should keep this open to track the issue, even if the fix is ultimately in llama.cpp

GiteaMirror commented

2026-04-22 10:08:54 -05:00

@ccreutzi commented on GitHub (Nov 5, 2024):

Started a corresponding report at https://github.com/ggerganov/llama.cpp/issues/10157. If anyone is faster at reproducing this without Ollama, please feel free to comment or edit there.

@ccreutzi commented on GitHub (Nov 5, 2024): Started a corresponding report at https://github.com/ggerganov/llama.cpp/issues/10157. If anyone is faster at reproducing this without Ollama, please feel free to comment or edit there.

GiteaMirror commented

2026-04-22 10:08:54 -05:00

@jessegross commented on GitHub (Nov 6, 2024):

Confirmed that this was trigged by the llama.cpp bump, though it might be a latent issue that is only being exposed through stricter error checking. Thanks for narrowing it down @rick-github !

@jessegross commented on GitHub (Nov 6, 2024): Confirmed that this was trigged by the llama.cpp bump, though it might be a latent issue that is only being exposed through stricter error checking. Thanks for narrowing it down @rick-github !

GiteaMirror commented

2026-04-22 10:08:55 -05:00

@elvizlai commented on GitHub (Dec 10, 2024):

For very long content without correct chunk split, it will cause this strange error.

same content using vLLM, it report

openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 512 tokens. However, you requested 786 tokens in the input for embedding generation. Please reduce the length of the input.", 'type': 'BadRequestError', 'param': None, 'code': 400}

But for Ollama, it server log contains

GGML_ASSERT(i01 >= 0 && i01 < ne01)

POST predict: Post \"http://127.0.0.1:32925/completion\": EOF

@elvizlai commented on GitHub (Dec 10, 2024): For very long content without correct chunk split, it will cause this strange error. same content using vLLM, it report ``` openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 512 tokens. However, you requested 786 tokens in the input for embedding generation. Please reduce the length of the input.", 'type': 'BadRequestError', 'param': None, 'code': 400} ``` But for Ollama, it server log contains ``` GGML_ASSERT(i01 >= 0 && i01 < ne01) POST predict: Post \"http://127.0.0.1:32925/completion\": EOF ```

GiteaMirror commented

2026-04-22 10:08:55 -05:00

@rick-github commented on GitHub (Dec 10, 2024):

It would be helpful if you could provide the server log, the model, and an example of the input you are using.

@rick-github commented on GitHub (Dec 10, 2024): It would be helpful if you could provide the server log, the model, and an example of the input you are using.

GiteaMirror commented

2026-04-22 10:08:56 -05:00

@sammyf commented on GitHub (Dec 27, 2024):

This happens for moondream, and only on CPU. ollama version is 0.5.4

sammy@raspberrypi:~ $ ollama run moondream
>>> /home/sammy/ollimca_icon.png 
Added image '/home/sammy/ollimca_icon.png'
Error: POST predict: Post "http://127.0.0.1:35537/completion": EOF

The image or its size doesn't seem to matter

@sammyf commented on GitHub (Dec 27, 2024): This happens for moondream, and only on CPU. ollama version is 0.5.4 ``` sammy@raspberrypi:~ $ ollama run moondream >>> /home/sammy/ollimca_icon.png Added image '/home/sammy/ollimca_icon.png' Error: POST predict: Post "http://127.0.0.1:35537/completion": EOF ``` The image or its size doesn't seem to matter ![ollimca_icon](https://github.com/user-attachments/assets/f67a788e-7c81-4efe-8e49-8195333c47fc)

GiteaMirror commented

2026-04-22 10:08:56 -05:00

@rick-github commented on GitHub (Dec 27, 2024):

It would be helpful if you could provide the server log.

@rick-github commented on GitHub (Dec 27, 2024): It would be helpful if you could provide the server log.

GiteaMirror commented

2026-04-22 10:08:57 -05:00

@sammyf commented on GitHub (Dec 27, 2024):

Here is the log. That's on a raspberry pi 5 with 8GB. Moondream used to work some times ago ( didn't use it in multiple months)


Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: model name:   vikhyatk/moondream2
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: description:  image encoder for vikhyatk/moondream2
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: GGUF version: 3
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: alignment:    32
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: n_tensors:    457
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: n_kv:         19
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: ftype:        f16
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9>
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   0:                       general.architecture str              = clip
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   4:                          general.file_type u32              = 1
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   5:                               general.name str              = vikhyatk/moondream2
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   7:                        clip.projector_type str              = mlp
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   8:                     clip.vision.image_size u32              = 378
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 2048
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  15:                    clip.vision.block_count u32              = 28
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  18:                              clip.use_gelu bool             = true
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - type  f32:  285 tensors
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - type  f16:  172 tensors
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: CLIP using CPU backend
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: text_encoder:   0
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: vision_encoder: 1
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: llava_projector:  1
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: minicpmv_projector:  0
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: model size:     867.61 MB
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: metadata size:  0.16 MB
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: params backend buffer size =  867.61 MB (457 tensors)
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: compute allocated memory: 50.10 MB
Dec 27 15:54:23 raspberrypi ollama[86846]: ggml-cpu.c:8482: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
Dec 27 15:54:24 raspberrypi ollama[86846]: [GIN] 2024/12/27 - 15:54:24 | 200 |      55.203µs |   192.168.0.100 | GET      "/api/ps"
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87149]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87150]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87151]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87152]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87153]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87155]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87154]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87159]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87160]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87161]
Dec 27 15:54:24 raspberrypi ollama[87163]: [Thread debugging using libthread_db enabled]
Dec 27 15:54:24 raspberrypi ollama[87163]: Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Dec 27 15:54:24 raspberrypi ollama[87163]: 0x00005555f90b7348 in ggml_barrier ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #0  0x00005555f90b7348 in ggml_barrier ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #1  0x00005555f90c3218 in ?? ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #2  0x00005555f90c5930 in ggml_graph_compute ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #3  0x00005555f9153fb4 in ?? ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #4  0x00005555f9148404 in ggml_backend_graph_compute ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #5  0x00005555f911f89c in clip_image_batch_encode ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #6  0x00005555f912251c in clip_image_encode ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #7  0x00005555f921d018 in llava_image_embed_make_with_clip_img ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #8  0x00005555f921da78 in llava_image_embed_make_with_bytes ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #9  0x00005555f909b690 in _cgo_eb41d09845a5_Cfunc_llava_image_embed_make_with_bytes ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #10 0x00005555f865735c in _start ()
Dec 27 15:54:24 raspberrypi ollama[87163]: Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Dec 27 15:54:24 raspberrypi ollama[87163]: [Inferior 1 (process 87148) detached]
Dec 27 15:54:24 raspberrypi ollama[86846]: [GIN] 2024/12/27 - 15:54:24 | 200 | 28.072215872s |       127.0.0.1 | POST     "/api/chat"

@sammyf commented on GitHub (Dec 27, 2024): Here is the log. That's on a raspberry pi 5 with 8GB. Moondream used to work some times ago ( didn't use it in multiple months) ``` Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: model name: vikhyatk/moondream2 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: description: image encoder for vikhyatk/moondream2 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: GGUF version: 3 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: alignment: 32 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: n_tensors: 457 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: n_kv: 19 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: ftype: f16 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9> Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 0: general.architecture str = clip Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 1: clip.has_text_encoder bool = false Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 2: clip.has_vision_encoder bool = true Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 3: clip.has_llava_projector bool = true Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 4: general.file_type u32 = 1 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 5: general.name str = vikhyatk/moondream2 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 6: general.description str = image encoder for vikhyatk/moondream2 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 7: clip.projector_type str = mlp Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 8: clip.vision.image_size u32 = 378 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 9: clip.vision.patch_size u32 = 14 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1152 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4304 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 12: clip.vision.projection_dim u32 = 2048 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 15: clip.vision.block_count u32 = 28 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000] Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000] Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 18: clip.use_gelu bool = true Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - type f32: 285 tensors Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - type f16: 172 tensors Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: CLIP using CPU backend Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: text_encoder: 0 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: vision_encoder: 1 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: llava_projector: 1 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: minicpmv_projector: 0 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: model size: 867.61 MB Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: metadata size: 0.16 MB Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: params backend buffer size = 867.61 MB (457 tensors) Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: compute allocated memory: 50.10 MB Dec 27 15:54:23 raspberrypi ollama[86846]: ggml-cpu.c:8482: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed Dec 27 15:54:24 raspberrypi ollama[86846]: [GIN] 2024/12/27 - 15:54:24 | 200 | 55.203µs | 192.168.0.100 | GET "/api/ps" Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87149] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87150] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87151] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87152] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87153] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87155] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87154] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87159] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87160] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87161] Dec 27 15:54:24 raspberrypi ollama[87163]: [Thread debugging using libthread_db enabled] Dec 27 15:54:24 raspberrypi ollama[87163]: Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1". Dec 27 15:54:24 raspberrypi ollama[87163]: 0x00005555f90b7348 in ggml_barrier () Dec 27 15:54:24 raspberrypi ollama[87163]: #0 0x00005555f90b7348 in ggml_barrier () Dec 27 15:54:24 raspberrypi ollama[87163]: #1 0x00005555f90c3218 in ?? () Dec 27 15:54:24 raspberrypi ollama[87163]: #2 0x00005555f90c5930 in ggml_graph_compute () Dec 27 15:54:24 raspberrypi ollama[87163]: #3 0x00005555f9153fb4 in ?? () Dec 27 15:54:24 raspberrypi ollama[87163]: #4 0x00005555f9148404 in ggml_backend_graph_compute () Dec 27 15:54:24 raspberrypi ollama[87163]: #5 0x00005555f911f89c in clip_image_batch_encode () Dec 27 15:54:24 raspberrypi ollama[87163]: #6 0x00005555f912251c in clip_image_encode () Dec 27 15:54:24 raspberrypi ollama[87163]: #7 0x00005555f921d018 in llava_image_embed_make_with_clip_img () Dec 27 15:54:24 raspberrypi ollama[87163]: #8 0x00005555f921da78 in llava_image_embed_make_with_bytes () Dec 27 15:54:24 raspberrypi ollama[87163]: #9 0x00005555f909b690 in _cgo_eb41d09845a5_Cfunc_llava_image_embed_make_with_bytes () Dec 27 15:54:24 raspberrypi ollama[87163]: #10 0x00005555f865735c in _start () Dec 27 15:54:24 raspberrypi ollama[87163]: Backtrace stopped: previous frame identical to this frame (corrupt stack?) Dec 27 15:54:24 raspberrypi ollama[87163]: [Inferior 1 (process 87148) detached] Dec 27 15:54:24 raspberrypi ollama[86846]: [GIN] 2024/12/27 - 15:54:24 | 200 | 28.072215872s | 127.0.0.1 | POST "/api/chat" ```

GiteaMirror commented

2026-04-22 10:08:59 -05:00

@rick-github commented on GitHub (Dec 28, 2024):

Looks like the same issue. You can either rollback to 0.3.13, try a different model, or get a GPU.

@rick-github commented on GitHub (Dec 28, 2024): Looks like the same issue. You can either rollback to 0.3.13, try a different model, or [get a GPU](https://www.jeffgeerling.com/blog/2024/use-external-gpu-on-raspberry-pi-5-4k-gaming).

GiteaMirror commented

2026-04-22 10:09:00 -05:00

@sammyf commented on GitHub (Dec 29, 2024):

Looks like the same issue. You can either rollback to 0.3.13, try a different model, or get a GPU.

If I may make a suggestion: close this and make a sticky somewhere about moondream ( and possibly other models ) not running on CPU-bound systems due to upstream issues.

( also, 0.3.13 had a bug with embedding that was resolved in 0.3.14, there are no other vision models able to run in very low RAM environment, and ... a GPU? On a RPi5 ;)

@sammyf commented on GitHub (Dec 29, 2024): > Looks like the same issue. You can either rollback to 0.3.13, try a different model, or [get a GPU](https://www.jeffgeerling.com/blog/2024/use-external-gpu-on-raspberry-pi-5-4k-gaming). If I may make a suggestion: close this and make a sticky somewhere about moondream ( and possibly other models ) not running on CPU-bound systems due to upstream issues. ( also, 0.3.13 had a bug with embedding that was resolved in 0.3.14, there are no other vision models able to run in very low RAM environment, and ... a GPU? On a RPi5 ;)

GiteaMirror commented

2026-04-22 10:09:01 -05:00

@rick-github commented on GitHub (Dec 29, 2024):

ollama team have chosen to keep this issue open for tracking.

and ... a GPU? On a RPi5 ;)

Did you click through the link?

@rick-github commented on GitHub (Dec 29, 2024): ollama team have chosen to keep this issue open for tracking. > and ... a GPU? On a RPi5 ;) Did you click through the link?

GiteaMirror commented

2026-04-22 10:09:01 -05:00

@bjonnh commented on GitHub (Jan 2, 2025):

I removed the assert in ggml-cpu.c and that's working now (not a long term solution, maybe you want to change the expected version somehow?)

@bjonnh commented on GitHub (Jan 2, 2025): I removed the assert in ggml-cpu.c and that's working now (not a long term solution, maybe you want to change the expected version somehow?)

GiteaMirror commented

2026-04-22 10:09:02 -05:00

@Justin-12138 commented on GitHub (Jan 14, 2025):

@rick-github Could you please have a look the error that I encountered
I got a embedding model ：bge-large:latest

And I my server 's log:
time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"

time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.682952212s | 10.88.128.143 | POST "/api/embed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.912196085s | 10.88.128.143 | POST "/api/embed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.917658805s | 10.88.128.143 | POST "/api/embed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.91196088s | 10.88.128.143 | POST "/api/embed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.912470179s | 10.88.128.143 | POST "/api/embed"

time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.919251593s | 10.88.128.143 | POST "/api/embed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.916054131s | 10.88.128.143 | POST "/api/embed"

time=2025-01-14T15:49:38.685Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.134886073 model=/root/.ollama/models/blobs/sha256-92b37e50807d951e27ead73c059cf9c3b14941498e37dfde57271e19e6d411df

@Justin-12138 commented on GitHub (Jan 14, 2025): @rick-github Could you please have a look the error that I encountered I got a embedding model ：bge-large:latest ![Image](https://github.com/user-attachments/assets/f47fbdef-dcbf-476c-acb5-b66805f06360) ![Image](https://github.com/user-attachments/assets/89094a10-acc2-41bf-a105-ffbc7a181b0f) And I my server 's log: time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.682952212s | 10.88.128.143 | POST "/api/embed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.912196085s | 10.88.128.143 | POST "/api/embed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.917658805s | 10.88.128.143 | POST "/api/embed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.91196088s | 10.88.128.143 | POST "/api/embed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.912470179s | 10.88.128.143 | POST "/api/embed" time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.919251593s | 10.88.128.143 | POST "/api/embed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.916054131s | 10.88.128.143 | POST "/api/embed" time=2025-01-14T15:49:38.685Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.134886073 model=/root/.ollama/models/blobs/sha256-92b37e50807d951e27ead73c059cf9c3b14941498e37dfde57271e19e6d411df

GiteaMirror commented

2026-04-22 10:09:03 -05:00

@rick-github commented on GitHub (Jan 15, 2025):

Your problem is different, it is this one: https://github.com/ollama/ollama/issues/7288.

The problem is that the context length that ollama is using is longer than the context length that the model supports. ollama is using the default of 2048, and bge-large:latest has a context length of 512:

$ ollama show bge-large:latest
  Model
    architecture        bert       
    parameters          334.09M    
    context length      512        
    embedding length    1024       
    quantization        F16

You can prevent these errors by setting "options":{"num_ctx":512} in the API call, or modifying the model to specify the context length:

ollama cp bge-large:latest bge-large:original
ollama rm bge-large:latest
ollama show --modelfile bge-large:original > Modelfile
echo PARAMETER num_ctx 512 >> Modelfile
ollama create -f Modelfile bge-large:latest

Note that the reason the errors are occurring is because you are getting embeddings for text lengths greater than that supported by the model. This means that the text will be truncated and the embeddings will be losing semantic content. You should adjust the chunk size of your embedding client to less than 512.

@rick-github commented on GitHub (Jan 15, 2025): Your problem is different, it is this one: https://github.com/ollama/ollama/issues/7288. The problem is that the context length that ollama is using is longer than the context length that the model supports. ollama is using the default of 2048, and bge-large:latest has a context length of 512: ```console $ ollama show bge-large:latest Model architecture bert parameters 334.09M context length 512 embedding length 1024 quantization F16 ``` You can prevent these errors by setting `"options":{"num_ctx":512}` in the API call, or modifying the model to specify the context length: ```console ollama cp bge-large:latest bge-large:original ollama rm bge-large:latest ollama show --modelfile bge-large:original > Modelfile echo PARAMETER num_ctx 512 >> Modelfile ollama create -f Modelfile bge-large:latest ``` Note that the reason the errors are occurring is because you are getting embeddings for text lengths greater than that supported by the model. This means that the text will be truncated and the embeddings will be losing semantic content. You should adjust the chunk size of your embedding client to less than 512.

GiteaMirror commented

2026-04-22 10:09:04 -05:00

@Justin-12138 commented on GitHub (Jan 15, 2025):

@rick-github Thanks,I tried the "options":{"num_ctx":512} ,It works well! 💯
but the logs always shows that

@Justin-12138 commented on GitHub (Jan 15, 2025): @rick-github Thanks,I tried the "options":{"num_ctx":512} ,It works well! 💯 but the logs always shows that ![Image](https://github.com/user-attachments/assets/ac0c0a66-aa0e-44df-bb25-fc387a09565f)

GiteaMirror commented

2026-04-22 10:09:06 -05:00

@rick-github commented on GitHub (Jan 15, 2025):

Follow up in #8431

@rick-github commented on GitHub (Jan 15, 2025): Follow up in #8431

GiteaMirror commented

2026-04-22 10:09:07 -05:00

@jobnomade commented on GitHub (Jan 24, 2025):

This happens for moondream, and only on CPU. ollama version is 0.5.4
sammy@raspberrypi:~ $ ollama run moondream
>>> /home/sammy/ollimca_icon.png 
Added image '/home/sammy/ollimca_icon.png'
Error: POST predict: Post "http://127.0.0.1:35537/completion": EOF
The image or its size doesn't seem to matter

I have the same observation. Wrapped my head around why it is crashing. (Tested on my local M2 MacBook) and on a 32 vCPU and 64 GB Ram server moondream2 crashed.

My Ollama version is 0.5.7-0-ga420a45-dirty (docker image).

As I understand, there is no solution but to rollback to an old version of ollama 0.3.13 or to get a GPU, correct?

My ollama server logs:

ollama    | time=2025-01-24T13:52:26.390Z level=INFO source=server.go:104 msg="system memory" total="62.9 GiB" free="58.0 GiB" free_swap="8.0 GiB"
ollama    | time=2025-01-24T13:52:26.392Z level=INFO source=memory.go:356 msg="offload to cpu" projector.weights="867.6 MiB" projector.graph="0 B" layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[58.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="0 B" memory.required.kv="1.5 GiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.1 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="82.2 MiB" memory.graph.full="544.0 MiB" memory.graph.partial="540.0 MiB"
ollama    | time=2025-01-24T13:52:26.392Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-e554c6b9de016673fd2c732e0342967727e9659ca5f853a4947cc96263fa602b --ctx-size 8192 --batch-size 512 --mmproj /root/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9f6b2856f305e39c38eed8f --threads 32 --no-mmap --parallel 4 --port 39517"
ollama    | time=2025-01-24T13:52:26.392Z level=INFO source=sched.go:449 msg="loaded runners" count=2
ollama    | time=2025-01-24T13:52:26.392Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
ollama    | time=2025-01-24T13:52:26.393Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
ollama    | time=2025-01-24T13:52:26.397Z level=INFO source=runner.go:936 msg="starting go runner"
ollama    | time=2025-01-24T13:52:26.400Z level=INFO source=runner.go:937 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=32
ollama    | time=2025-01-24T13:52:26.400Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39517"
ollama    | llama_model_loader: loaded meta data with 20 key-value pairs and 245 tensors from /root/.ollama/models/blobs/sha256-e554c6b9de016673fd2c732e0342967727e9659ca5f853a4947cc96263fa602b (version GGUF V3 (latest))
ollama    | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama    | llama_model_loader: - kv   0:                       general.architecture str              = phi2
ollama    | llama_model_loader: - kv   1:                               general.name str              = moondream2
ollama    | llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
ollama    | llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2048
ollama    | llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 8192
ollama    | llama_model_loader: - kv   5:                           phi2.block_count u32              = 24
ollama    | llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
ollama    | llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
ollama    | llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
ollama    | llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
ollama    | llama_model_loader: - kv  10:                          general.file_type u32              = 2
ollama    | llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
ollama    | llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
ollama    | llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama    | llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama    | llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["�� t", "�� a", "h e", "i n", "r e",...
ollama    | llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
ollama    | llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
ollama    | llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
ollama    | llama_model_loader: - kv  19:               general.quantization_version u32              = 2
ollama    | llama_model_loader: - type  f32:  147 tensors
ollama    | llama_model_loader: - type q4_0:   97 tensors
ollama    | llama_model_loader: - type q6_K:    1 tensors
ollama    | llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
ollama    | llm_load_vocab: special tokens cache size = 944
ollama    | llm_load_vocab: token to piece cache size = 0.3151 MB
ollama    | llm_load_print_meta: format           = GGUF V3 (latest)
ollama    | llm_load_print_meta: arch             = phi2
ollama    | llm_load_print_meta: vocab type       = BPE
ollama    | llm_load_print_meta: n_vocab          = 51200
ollama    | llm_load_print_meta: n_merges         = 50000
ollama    | llm_load_print_meta: vocab_only       = 0
ollama    | llm_load_print_meta: n_ctx_train      = 2048
ollama    | llm_load_print_meta: n_embd           = 2048
ollama    | llm_load_print_meta: n_layer          = 24
ollama    | llm_load_print_meta: n_head           = 32
ollama    | llm_load_print_meta: n_head_kv        = 32
ollama    | llm_load_print_meta: n_rot            = 32
ollama    | llm_load_print_meta: n_swa            = 0
ollama    | llm_load_print_meta: n_embd_head_k    = 64
ollama    | llm_load_print_meta: n_embd_head_v    = 64
ollama    | llm_load_print_meta: n_gqa            = 1
ollama    | llm_load_print_meta: n_embd_k_gqa     = 2048
ollama    | llm_load_print_meta: n_embd_v_gqa     = 2048
ollama    | llm_load_print_meta: f_norm_eps       = 1.0e-05
ollama    | llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
ollama    | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
ollama    | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama    | llm_load_print_meta: f_logit_scale    = 0.0e+00
ollama    | llm_load_print_meta: n_ff             = 8192
ollama    | llm_load_print_meta: n_expert         = 0
ollama    | llm_load_print_meta: n_expert_used    = 0
ollama    | llm_load_print_meta: causal attn      = 1
ollama    | llm_load_print_meta: pooling type     = 0
ollama    | llm_load_print_meta: rope type        = 2
ollama    | llm_load_print_meta: rope scaling     = linear
ollama    | llm_load_print_meta: freq_base_train  = 10000.0
ollama    | llm_load_print_meta: freq_scale_train = 1
ollama    | llm_load_print_meta: n_ctx_orig_yarn  = 2048
ollama    | llm_load_print_meta: rope_finetuned   = unknown
ollama    | llm_load_print_meta: ssm_d_conv       = 0
ollama    | llm_load_print_meta: ssm_d_inner      = 0
ollama    | llm_load_print_meta: ssm_d_state      = 0
ollama    | llm_load_print_meta: ssm_dt_rank      = 0
ollama    | llm_load_print_meta: ssm_dt_b_c_rms   = 0
ollama    | llm_load_print_meta: model type       = 1B
ollama    | llm_load_print_meta: model ftype      = Q4_0
ollama    | llm_load_print_meta: model params     = 1.42 B
ollama    | llm_load_print_meta: model size       = 788.55 MiB (4.66 BPW)
ollama    | llm_load_print_meta: general.name     = moondream2
ollama    | llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
ollama    | llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
ollama    | llm_load_print_meta: EOT token        = 50256 '<|endoftext|>'
ollama    | llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
ollama    | llm_load_print_meta: LF token         = 128 '��'
ollama    | llm_load_print_meta: EOG token        = 50256 '<|endoftext|>'
ollama    | llm_load_print_meta: max token length = 256
ollama    | llm_load_tensors:          CPU model buffer size =   140.55 MiB
ollama    | llm_load_tensors:  CPU_AARCH64 model buffer size =   648.00 MiB
ollama    | time=2025-01-24T13:52:26.644Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
ollama    | llama_new_context_with_model: n_seq_max     = 4
ollama    | llama_new_context_with_model: n_ctx         = 8192
ollama    | llama_new_context_with_model: n_ctx_per_seq = 2048
ollama    | llama_new_context_with_model: n_batch       = 2048
ollama    | llama_new_context_with_model: n_ubatch      = 512
ollama    | llama_new_context_with_model: flash_attn    = 0
ollama    | llama_new_context_with_model: freq_base     = 10000.0
ollama    | llama_new_context_with_model: freq_scale    = 1
ollama    | llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
ollama    | llama_kv_cache_init:        CPU KV buffer size =  1536.00 MiB
ollama    | llama_new_context_with_model: KV self size  = 1536.00 MiB, K (f16):  768.00 MiB, V (f16):  768.00 MiB
ollama    | llama_new_context_with_model:        CPU  output buffer size =     0.81 MiB
ollama    | llama_new_context_with_model:        CPU compute buffer size =   556.01 MiB
ollama    | llama_new_context_with_model: graph nodes  = 921
ollama    | llama_new_context_with_model: graph splits = 1
ollama    | clip_model_load: model name:   vikhyatk/moondream2
ollama    | clip_model_load: description:  image encoder for vikhyatk/moondream2
ollama    | clip_model_load: GGUF version: 3
ollama    | clip_model_load: alignment:    32
ollama    | clip_model_load: n_tensors:    457
ollama    | clip_model_load: n_kv:         19
ollama    | clip_model_load: ftype:        f16
ollama    |
ollama    | clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from /root/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9f6b2856f305e39c38eed8f
ollama    | clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama    | clip_model_load: - kv   0:                       general.architecture str              = clip
ollama    | clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
ollama    | clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
ollama    | clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
ollama    | clip_model_load: - kv   4:                          general.file_type u32              = 1
ollama    | clip_model_load: - kv   5:                               general.name str              = vikhyatk/moondream2
ollama    | clip_model_load: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
ollama    | clip_model_load: - kv   7:                        clip.projector_type str              = mlp
ollama    | clip_model_load: - kv   8:                     clip.vision.image_size u32              = 378
ollama    | clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
ollama    | clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
ollama    | clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
ollama    | clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 2048
ollama    | clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
ollama    | clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
ollama    | clip_model_load: - kv  15:                    clip.vision.block_count u32              = 28
ollama    | clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
ollama    | clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
ollama    | clip_model_load: - kv  18:                              clip.use_gelu bool             = true
ollama    | clip_model_load: - type  f32:  285 tensors
ollama    | clip_model_load: - type  f16:  172 tensors
ollama    | clip_model_load: CLIP using CPU backend
ollama    | key clip.use_silu not found in file
ollama    | clip_model_load: text_encoder:   0
ollama    | clip_model_load: vision_encoder: 1
ollama    | clip_model_load: llava_projector:  1
ollama    | clip_model_load: minicpmv_projector:  0
ollama    | clip_model_load: model size:     867.61 MB
ollama    | clip_model_load: metadata size:  0.16 MB
ollama    | clip_model_load: params backend buffer size =  867.61 MB (457 tensors)
ollama    | key clip.vision.image_grid_pinpoints not found in file
ollama    | key clip.vision.mm_patch_merge_type not found in file
ollama    | key clip.vision.image_crop_resolution not found in file
ollama    | clip_model_load: compute allocated memory: 50.10 MB
ollama    | time=2025-01-24T13:52:28.904Z level=INFO source=server.go:594 msg="llama runner started in 2.51 seconds"
ollama    | ggml-cpu.c:8482: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed

What I find a bit strange is, when I try to find help / documentation I found on https://github.com/vikhyat/moondream/blob/main/README.md the following comment from the author.

⚠️ Note: The Python client currently only supports CPU inference. CUDA (GPU) and MPS (Apple Silicon) optimization is coming soon. For GPU support, use the Hugging Face transformers implementation below.

It is a bit misleading though when CPU is not supported.

Thanks for the awesome work and for Ollama!

@jobnomade commented on GitHub (Jan 24, 2025): > This happens for moondream, and only on CPU. ollama version is 0.5.4 > > ``` > sammy@raspberrypi:~ $ ollama run moondream > >>> /home/sammy/ollimca_icon.png > Added image '/home/sammy/ollimca_icon.png' > Error: POST predict: Post "http://127.0.0.1:35537/completion": EOF > ``` > > The image or its size doesn't seem to matter ![ollimca_icon](https://github.com/user-attachments/assets/f67a788e-7c81-4efe-8e49-8195333c47fc) I have the same observation. Wrapped my head around why it is crashing. (Tested on my local M2 MacBook) and on a 32 vCPU and 64 GB Ram server moondream2 crashed. My Ollama version is 0.5.7-0-ga420a45-dirty (docker image). As I understand, there is no solution but to rollback to an old version of ollama 0.3.13 or to get a GPU, correct? My ollama server logs: ```log ollama | time=2025-01-24T13:52:26.390Z level=INFO source=server.go:104 msg="system memory" total="62.9 GiB" free="58.0 GiB" free_swap="8.0 GiB" ollama | time=2025-01-24T13:52:26.392Z level=INFO source=memory.go:356 msg="offload to cpu" projector.weights="867.6 MiB" projector.graph="0 B" layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[58.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="0 B" memory.required.kv="1.5 GiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.1 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="82.2 MiB" memory.graph.full="544.0 MiB" memory.graph.partial="540.0 MiB" ollama | time=2025-01-24T13:52:26.392Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-e554c6b9de016673fd2c732e0342967727e9659ca5f853a4947cc96263fa602b --ctx-size 8192 --batch-size 512 --mmproj /root/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9f6b2856f305e39c38eed8f --threads 32 --no-mmap --parallel 4 --port 39517" ollama | time=2025-01-24T13:52:26.392Z level=INFO source=sched.go:449 msg="loaded runners" count=2 ollama | time=2025-01-24T13:52:26.392Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" ollama | time=2025-01-24T13:52:26.393Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" ollama | time=2025-01-24T13:52:26.397Z level=INFO source=runner.go:936 msg="starting go runner" ollama | time=2025-01-24T13:52:26.400Z level=INFO source=runner.go:937 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=32 ollama | time=2025-01-24T13:52:26.400Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39517" ollama | llama_model_loader: loaded meta data with 20 key-value pairs and 245 tensors from /root/.ollama/models/blobs/sha256-e554c6b9de016673fd2c732e0342967727e9659ca5f853a4947cc96263fa602b (version GGUF V3 (latest)) ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ollama | llama_model_loader: - kv 0: general.architecture str = phi2 ollama | llama_model_loader: - kv 1: general.name str = moondream2 ollama | llama_model_loader: - kv 2: phi2.context_length u32 = 2048 ollama | llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048 ollama | llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192 ollama | llama_model_loader: - kv 5: phi2.block_count u32 = 24 ollama | llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32 ollama | llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32 ollama | llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010 ollama | llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32 ollama | llama_model_loader: - kv 10: general.file_type u32 = 2 ollama | llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false ollama | llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2 ollama | llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ... ollama | llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ollama | llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["�� t", "�� a", "h e", "i n", "r e",... ollama | llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256 ollama | llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256 ollama | llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256 ollama | llama_model_loader: - kv 19: general.quantization_version u32 = 2 ollama | llama_model_loader: - type f32: 147 tensors ollama | llama_model_loader: - type q4_0: 97 tensors ollama | llama_model_loader: - type q6_K: 1 tensors ollama | llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default' ollama | llm_load_vocab: special tokens cache size = 944 ollama | llm_load_vocab: token to piece cache size = 0.3151 MB ollama | llm_load_print_meta: format = GGUF V3 (latest) ollama | llm_load_print_meta: arch = phi2 ollama | llm_load_print_meta: vocab type = BPE ollama | llm_load_print_meta: n_vocab = 51200 ollama | llm_load_print_meta: n_merges = 50000 ollama | llm_load_print_meta: vocab_only = 0 ollama | llm_load_print_meta: n_ctx_train = 2048 ollama | llm_load_print_meta: n_embd = 2048 ollama | llm_load_print_meta: n_layer = 24 ollama | llm_load_print_meta: n_head = 32 ollama | llm_load_print_meta: n_head_kv = 32 ollama | llm_load_print_meta: n_rot = 32 ollama | llm_load_print_meta: n_swa = 0 ollama | llm_load_print_meta: n_embd_head_k = 64 ollama | llm_load_print_meta: n_embd_head_v = 64 ollama | llm_load_print_meta: n_gqa = 1 ollama | llm_load_print_meta: n_embd_k_gqa = 2048 ollama | llm_load_print_meta: n_embd_v_gqa = 2048 ollama | llm_load_print_meta: f_norm_eps = 1.0e-05 ollama | llm_load_print_meta: f_norm_rms_eps = 0.0e+00 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 ollama | llm_load_print_meta: n_ff = 8192 ollama | llm_load_print_meta: n_expert = 0 ollama | llm_load_print_meta: n_expert_used = 0 ollama | llm_load_print_meta: causal attn = 1 ollama | llm_load_print_meta: pooling type = 0 ollama | llm_load_print_meta: rope type = 2 ollama | llm_load_print_meta: rope scaling = linear ollama | llm_load_print_meta: freq_base_train = 10000.0 ollama | llm_load_print_meta: freq_scale_train = 1 ollama | llm_load_print_meta: n_ctx_orig_yarn = 2048 ollama | llm_load_print_meta: rope_finetuned = unknown ollama | llm_load_print_meta: ssm_d_conv = 0 ollama | llm_load_print_meta: ssm_d_inner = 0 ollama | llm_load_print_meta: ssm_d_state = 0 ollama | llm_load_print_meta: ssm_dt_rank = 0 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 ollama | llm_load_print_meta: model type = 1B ollama | llm_load_print_meta: model ftype = Q4_0 ollama | llm_load_print_meta: model params = 1.42 B ollama | llm_load_print_meta: model size = 788.55 MiB (4.66 BPW) ollama | llm_load_print_meta: general.name = moondream2 ollama | llm_load_print_meta: BOS token = 50256 '<|endoftext|>' ollama | llm_load_print_meta: EOS token = 50256 '<|endoftext|>' ollama | llm_load_print_meta: EOT token = 50256 '<|endoftext|>' ollama | llm_load_print_meta: UNK token = 50256 '<|endoftext|>' ollama | llm_load_print_meta: LF token = 128 '��' ollama | llm_load_print_meta: EOG token = 50256 '<|endoftext|>' ollama | llm_load_print_meta: max token length = 256 ollama | llm_load_tensors: CPU model buffer size = 140.55 MiB ollama | llm_load_tensors: CPU_AARCH64 model buffer size = 648.00 MiB ollama | time=2025-01-24T13:52:26.644Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" ollama | llama_new_context_with_model: n_seq_max = 4 ollama | llama_new_context_with_model: n_ctx = 8192 ollama | llama_new_context_with_model: n_ctx_per_seq = 2048 ollama | llama_new_context_with_model: n_batch = 2048 ollama | llama_new_context_with_model: n_ubatch = 512 ollama | llama_new_context_with_model: flash_attn = 0 ollama | llama_new_context_with_model: freq_base = 10000.0 ollama | llama_new_context_with_model: freq_scale = 1 ollama | llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1 ollama | llama_kv_cache_init: CPU KV buffer size = 1536.00 MiB ollama | llama_new_context_with_model: KV self size = 1536.00 MiB, K (f16): 768.00 MiB, V (f16): 768.00 MiB ollama | llama_new_context_with_model: CPU output buffer size = 0.81 MiB ollama | llama_new_context_with_model: CPU compute buffer size = 556.01 MiB ollama | llama_new_context_with_model: graph nodes = 921 ollama | llama_new_context_with_model: graph splits = 1 ollama | clip_model_load: model name: vikhyatk/moondream2 ollama | clip_model_load: description: image encoder for vikhyatk/moondream2 ollama | clip_model_load: GGUF version: 3 ollama | clip_model_load: alignment: 32 ollama | clip_model_load: n_tensors: 457 ollama | clip_model_load: n_kv: 19 ollama | clip_model_load: ftype: f16 ollama | ollama | clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from /root/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9f6b2856f305e39c38eed8f ollama | clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ollama | clip_model_load: - kv 0: general.architecture str = clip ollama | clip_model_load: - kv 1: clip.has_text_encoder bool = false ollama | clip_model_load: - kv 2: clip.has_vision_encoder bool = true ollama | clip_model_load: - kv 3: clip.has_llava_projector bool = true ollama | clip_model_load: - kv 4: general.file_type u32 = 1 ollama | clip_model_load: - kv 5: general.name str = vikhyatk/moondream2 ollama | clip_model_load: - kv 6: general.description str = image encoder for vikhyatk/moondream2 ollama | clip_model_load: - kv 7: clip.projector_type str = mlp ollama | clip_model_load: - kv 8: clip.vision.image_size u32 = 378 ollama | clip_model_load: - kv 9: clip.vision.patch_size u32 = 14 ollama | clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1152 ollama | clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4304 ollama | clip_model_load: - kv 12: clip.vision.projection_dim u32 = 2048 ollama | clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16 ollama | clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001 ollama | clip_model_load: - kv 15: clip.vision.block_count u32 = 28 ollama | clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000] ollama | clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000] ollama | clip_model_load: - kv 18: clip.use_gelu bool = true ollama | clip_model_load: - type f32: 285 tensors ollama | clip_model_load: - type f16: 172 tensors ollama | clip_model_load: CLIP using CPU backend ollama | key clip.use_silu not found in file ollama | clip_model_load: text_encoder: 0 ollama | clip_model_load: vision_encoder: 1 ollama | clip_model_load: llava_projector: 1 ollama | clip_model_load: minicpmv_projector: 0 ollama | clip_model_load: model size: 867.61 MB ollama | clip_model_load: metadata size: 0.16 MB ollama | clip_model_load: params backend buffer size = 867.61 MB (457 tensors) ollama | key clip.vision.image_grid_pinpoints not found in file ollama | key clip.vision.mm_patch_merge_type not found in file ollama | key clip.vision.image_crop_resolution not found in file ollama | clip_model_load: compute allocated memory: 50.10 MB ollama | time=2025-01-24T13:52:28.904Z level=INFO source=server.go:594 msg="llama runner started in 2.51 seconds" ollama | ggml-cpu.c:8482: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed ``` What I find a bit strange is, when I try to find help / documentation I found on https://github.com/vikhyat/moondream/blob/main/README.md the following comment from the author. > ⚠️ Note: The Python client currently only supports CPU inference. CUDA (GPU) and MPS (Apple Silicon) optimization is coming soon. For GPU support, use the Hugging Face transformers implementation below. It is a bit misleading though when CPU is not supported. Thanks for the awesome work and for Ollama!

GiteaMirror commented

2026-04-22 10:09:08 -05:00

@rick-github commented on GitHub (Jan 24, 2025):

As I understand, there is no solution but to rollback to an old version of ollama 0.3.13 or to get a GPU, correct?

Correct. The llama.cpp issue has been closed as stale, as nobody has the time to dig into it. You can follow bjonnh's example and build a custom version with the assert removed, although this may have other issues.

@rick-github commented on GitHub (Jan 24, 2025): > As I understand, there is no solution but to rollback to an old version of ollama 0.3.13 or to get a GPU, correct? Correct. The llama.cpp issue has been closed as stale, as nobody has the time to dig into it. You can follow bjonnh's example and build a custom version with the `assert` removed, although this may have other issues.

GiteaMirror commented

2026-04-22 10:09:09 -05:00

@jobnomade commented on GitHub (Jan 24, 2025):

I switched now to llava-phi3 model https://ollama.com/library/llava-phi3:3.8b-mini-q4_0. It works on CPU.

I am not a C++ dev or have not deep experience in llama.cpp. I prompted the issue in Cursor and it provided me this walk through, maybe of help. If not, ignore it.

The Specific Error Point:

GGML_ASSERT(i01 >= 0 && i01 < ne01);

This assertion is failing in ggml-cpu.c, which suggests that an index i01 is either negative or exceeding the expected tensor dimension ne01. This occurs during tensor operations, specifically during the CLIP vision encoder's attention mechanism.

Analyzing the Model Architecture:
From the logs:

clip_model_load: vision_encoder: 1
clip.vision.image_size u32              = 378
clip.vision.patch_size u32              = 14
clip.vision.embedding_length u32        = 1152
clip.vision.attention.head_count u32    = 16

The model uses a CLIP vision encoder with:

Input image size: 378x378
Patch size: 14x14
Embedding dimension: 1152
Number of attention heads: 16

Potential Root Causes:

a. Tensor Shape Mismatch:

The assertion is likely failing because of a dimension mismatch during attention computation
The attention mechanism expects certain tensor shapes based on these parameters:
- Number of patches: (378/14)² = 27x27 = 729 patches
- Attention head dimension: 1152/16 = 72 dimensions per head

b. Quantization Issues:

llm_load_print_meta: model ftype      = Q4_0

The model is using Q4_0 quantization
This aggressive quantization might be causing precision issues that affect tensor dimensions

c. Memory Layout:

llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048

The context size configuration might be causing memory alignment issues during CPU computation

Why It's CPU-Specific:

GPU implementations often have more flexible memory handling
CPU implementations need stricter bounds checking
The assertion might be too strict for the CPU implementation's memory layout

Technical Analysis:

// Pseudo-code of what might be happening
for (int i01 = 0; i01 < ne01; i01++) {
    // During attention computation
    // i01 represents position in sequence
    // ne01 is expected sequence length
    GGML_ASSERT(i01 >= 0 && i01 < ne01);  // This fails
}

The likely scenarios are:

The attention computation is trying to access positions beyond the expected sequence length
The tensor dimensions are not properly aligned after quantization
The memory layout assumptions in the CPU implementation don't match the model's requirements

Recommended Solutions:

Proper Fix Would Involve:

// Add dimension checking before computation
if (ne01 != expected_sequence_length) {
    // Realign dimensions or raise proper error
}

// Or add padding/truncation handling
i01 = min(i01, ne01 - 1);

Model-side Fix:

Ensure tensor dimensions are properly aligned
Add proper padding handling
Validate sequence lengths before computation

Runtime Fix:

Add proper dimension validation
Implement dynamic padding
Add proper error handling for dimension mismatches

@jobnomade commented on GitHub (Jan 24, 2025): I switched now to llava-phi3 model https://ollama.com/library/llava-phi3:3.8b-mini-q4_0. It works on CPU. I am not a C++ dev or have not deep experience in llama.cpp. I prompted the issue in Cursor and it provided me this walk through, maybe of help. If not, ignore it. 1. **The Specific Error Point**: ```c GGML_ASSERT(i01 >= 0 && i01 < ne01); ``` This assertion is failing in `ggml-cpu.c`, which suggests that an index `i01` is either negative or exceeding the expected tensor dimension `ne01`. This occurs during tensor operations, specifically during the CLIP vision encoder's attention mechanism. 2. **Analyzing the Model Architecture**: From the logs: ``` clip_model_load: vision_encoder: 1 clip.vision.image_size u32 = 378 clip.vision.patch_size u32 = 14 clip.vision.embedding_length u32 = 1152 clip.vision.attention.head_count u32 = 16 ``` The model uses a CLIP vision encoder with: - Input image size: 378x378 - Patch size: 14x14 - Embedding dimension: 1152 - Number of attention heads: 16 3. **Potential Root Causes**: a. **Tensor Shape Mismatch**: - The assertion is likely failing because of a dimension mismatch during attention computation - The attention mechanism expects certain tensor shapes based on these parameters: - Number of patches: (378/14)² = 27x27 = 729 patches - Attention head dimension: 1152/16 = 72 dimensions per head b. **Quantization Issues**: ``` llm_load_print_meta: model ftype = Q4_0 ``` - The model is using Q4_0 quantization - This aggressive quantization might be causing precision issues that affect tensor dimensions c. **Memory Layout**: ``` llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_ctx_per_seq = 2048 ``` - The context size configuration might be causing memory alignment issues during CPU computation 4. **Why It's CPU-Specific**: - GPU implementations often have more flexible memory handling - CPU implementations need stricter bounds checking - The assertion might be too strict for the CPU implementation's memory layout 5. **Technical Analysis**: ```c // Pseudo-code of what might be happening for (int i01 = 0; i01 < ne01; i01++) { // During attention computation // i01 represents position in sequence // ne01 is expected sequence length GGML_ASSERT(i01 >= 0 && i01 < ne01); // This fails } ``` The likely scenarios are: 1. The attention computation is trying to access positions beyond the expected sequence length 2. The tensor dimensions are not properly aligned after quantization 3. The memory layout assumptions in the CPU implementation don't match the model's requirements **Recommended Solutions**: 1. **Proper Fix Would Involve**: ```c // Add dimension checking before computation if (ne01 != expected_sequence_length) { // Realign dimensions or raise proper error } // Or add padding/truncation handling i01 = min(i01, ne01 - 1); ``` 2. **Model-side Fix**: - Ensure tensor dimensions are properly aligned - Add proper padding handling - Validate sequence lengths before computation 3. **Runtime Fix**: - Add proper dimension validation - Implement dynamic padding - Add proper error handling for dimension mismatches

GiteaMirror commented

2026-04-22 10:09:09 -05:00

@alex-jw-brooks commented on GitHub (Feb 20, 2025):

I ran into this issue while adding support for granite vision to llama cpp / ollama and have opened a fix in llama.cpp here.

The issue is that the patches vector that is used to grab the rows from the visual features right before the projector uses values from [1, ... num_features], where 0 is skipped to handle the CLS feature. In the case of visual encoders like siglip, which have no CLS, this causes the following situation:

Siglip has embedding dim 729
Patches is initialized with values [1, ..., 729] because because of the hardcoded +1 for CLS
But then, since there is no CLS, 729 is out of range

I've verified the fix by testing in both llama cpp / ollama with NanoLlava and granite vision 🙂

@alex-jw-brooks commented on GitHub (Feb 20, 2025): I ran into this issue while adding support for granite vision to llama cpp / ollama and have opened a fix in `llama.cpp` [here](https://github.com/ggml-org/llama.cpp/pull/11982). The issue is that the `patches` vector that is used to grab the rows from the visual features right before the projector uses values from `[1, ... num_features]`, where 0 is skipped to handle the CLS feature. In the case of visual encoders like siglip, which have no CLS, this causes the following situation: - Siglip has embedding dim 729 - Patches is initialized with values [1, ..., 729] because because of the hardcoded `+1` for CLS - But then, since there is no CLS, `729` is out of range I've verified the fix by testing in both llama cpp / ollama with NanoLlava and granite vision 🙂

GiteaMirror commented

2026-04-22 10:09:11 -05:00

@jessegross commented on GitHub (Feb 28, 2025):

@alex-jw-brooks's patch is now in main, so we can finally close this issue. Thanks!

@jessegross commented on GitHub (Feb 28, 2025): @alex-jw-brooks's patch is now in `main`, so we can finally close this issue. Thanks!

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#30490