[GH-ISSUE #7441] Error: unknown error was encountered while running the model GGML_ASSERT(i01 >= 0 && i01 < ne01) failed #30490

Closed
opened 2026-04-22 10:08:45 -05:00 by GiteaMirror · 25 comments
Owner

Originally created by @kalcao on GitHub (Oct 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7441

What is the issue?

Nanollava returns GGML_ASSERT(i01 >= 0 && i01 < ne01) failed error on chat with image

Output:

ubuntu@ubuntu:~/workspace$ ollama run qnguyen3/nanollava "tell me what do you see in this picture? ./sample.jpg"
Added image './sample.jpg'
Error: an unknown error was encountered while running the model GGML_ASSERT(i01 >= 0 && i01 < ne01) failed

OS

Linux

GPU

Other

CPU

AMD

Ollama version

0.3.14

Originally created by @kalcao on GitHub (Oct 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7441 ### What is the issue? [Nanollava](https://ollama.com/qnguyen3/nanollava) returns `GGML_ASSERT(i01 >= 0 && i01 < ne01) failed` error on chat with image Output: ```bash ubuntu@ubuntu:~/workspace$ ollama run qnguyen3/nanollava "tell me what do you see in this picture? ./sample.jpg" Added image './sample.jpg' Error: an unknown error was encountered while running the model GGML_ASSERT(i01 >= 0 && i01 < ne01) failed ``` ### OS Linux ### GPU Other ### CPU AMD ### Ollama version 0.3.14
GiteaMirror added the bug label 2026-04-22 10:08:45 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 31, 2024):

Server logs may aid in debugging. It would be helpful to also have a copy of the image that caused the failure.

$ ollama run qnguyen3/nanollava "tell me what do you see in this picture? ./puppy.jpg"
Added image './puppy.jpg'
This image showcases a domestic scene of a small white puppy with black eyes, standing on a concrete
ledge. The puppy appears  to be crouching down, looking over the edge of the ledge. The environment
seems to have a stone floor or steps nearby, and there's a cat resting near the edge of a post. A red collar
around the puppy's neck is also visible.
$ ollama -v
ollama version is 0.3.14
<!-- gh-comment-id:2451029213 --> @rick-github commented on GitHub (Oct 31, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging. It would be helpful to also have a copy of the image that caused the failure. ```console $ ollama run qnguyen3/nanollava "tell me what do you see in this picture? ./puppy.jpg" Added image './puppy.jpg' This image showcases a domestic scene of a small white puppy with black eyes, standing on a concrete ledge. The puppy appears to be crouching down, looking over the edge of the ledge. The environment seems to have a stone floor or steps nearby, and there's a cat resting near the edge of a post. A red collar around the puppy's neck is also visible. $ ollama -v ollama version is 0.3.14 ```
Author
Owner

@kalcao commented on GitHub (Nov 1, 2024):

I have uploaded the logs to https://hastebin.skyra.pw/edinabafos.prolog
The python code I used for it is

import ollama

res = ollama.chat(
	model="qnguyen3/nanollava:latest",
	messages=[
		{
			'role': 'user',
			'content': 'Describe this image:',
			'images': ['./image.png']
		}
	]
)

print(res['message']['content'])

image
It's the picture what I have tried. I tried with other pictures but still getting the same error

<!-- gh-comment-id:2452611433 --> @kalcao commented on GitHub (Nov 1, 2024): I have uploaded the logs to https://hastebin.skyra.pw/edinabafos.prolog The python code I used for it is ```py import ollama res = ollama.chat( model="qnguyen3/nanollava:latest", messages=[ { 'role': 'user', 'content': 'Describe this image:', 'images': ['./image.png'] } ] ) print(res['message']['content']) ``` ![image](https://github.com/user-attachments/assets/73832b86-d277-47fc-b47e-4b4a7da4817a) It's the picture what I have tried. I tried with other pictures but still getting the same error
Author
Owner

@rick-github commented on GitHub (Nov 1, 2024):

This affects CPU based runners (cpu, cpu_avx, cpu_axv2) from 0.3.14 onwards. Earlier versions work fine, as does CUDA based runners in all versions through to 0.4.0-rc6. ROCm and Metal untested.

$ curl localhost:11434/api/version
{"version":"0.3.13"}
$ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq
{
  "model": "qnguyen3/nanollava",
  "created_at": "2024-11-01T22:29:39.812672017Z",
  "message": {
    "role": "assistant",
    "content": "An outdoor scene of a river with blue water."
  },
  "done_reason": "stop",
  "done": true,
  "total_duration": 4992363967,
  "load_duration": 1942686202,
  "prompt_eval_duration": 2707665000,
  "eval_count": 11,
  "eval_duration": 272003000
}
$ curl localhost:11434/api/version
{"version":"0.3.14"}
$ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq
{
  "error": "an unknown error was encountered while running the model GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"
}
$ curl localhost:11434/api/version
{"version":"0.4.0-rc6"}
$ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq
{
  "error": "POST predict: Post \"http://127.0.0.1:32925/completion\": EOF"
}

Other llava based models appear unaffected:

$ curl localhost:11434/api/version
{"version":"0.3.14"}
$ (echo '{"model":"llava:7b-v1.5-q4_K_M","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq .message.content
" The image captures a serene scene of a large lake, surrounded by trees and hills. A beautiful mountain stream flows into the lake from a higher elevation. The tranquil water reflects the landscape, creating an idyllic setting for relaxation or leisure activities.\n\nA small plant is visible near the water's edge, adding to the natural beauty of the scene. In the background, there are several cars parked at various distances from each other, potentially belonging to visitors enjoying the view and tranquility that this picturesque landscape offers."
$ (echo '{"model":"llava-llama3:8b-v1.1-q4_0","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq .message.content
"The image captures a serene scene of a lake at sunset. The vantage point is from the shore, looking out towards the calm expanse of water that stretches out to a distant shore adorned with trees and hills. The sky above is painted in hues of blue, gradually transitioning into a warm orange as it meets the horizon. The sun is partially obscured by the hillside in the distance, casting long shadows across the lake's surface.\n\nIn the foreground on the right side of the image, there are small plants peeking out from the ground, adding a touch of green to the scene. On the left side of the image, there's a small boat moored near the shore, perhaps indicating that this is a place where people come to enjoy the tranquility of the lake.\n\nThe overall atmosphere conveyed by the image is one of peace and quietude, as if time itself has slowed down in this particular corner of the world. The precise location of each object - from the boat on the left to the plants on the right, and the hills beyond the water - adds depth to the image, creating a sense of distance and scale.\n\nThere's no text visible in the image, reinforcing the impression that this is a place untouched by modernity or technology. The relative positions of the objects suggest a well-balanced composition, with each element contributing to the overall harmony of the scene. The image doesn't just show what can be seen; it tells a story of a quiet moment frozen in time, a snapshot of nature's beauty undisturbed."

The assert happens at this line: 8a9bb0d000/llama/ggml.c (L13425)
I believe this is a result of f2890a4494, which added support for the granite models and as a side effect, bumped to a new version of llama.cpp.

This probably needs to be logged with llama.cpp. In the meantime, you can either rollback to 0.3.13, try a different model, or get a GPU.

<!-- gh-comment-id:2452701808 --> @rick-github commented on GitHub (Nov 1, 2024): This affects CPU based runners (cpu, cpu_avx, cpu_axv2) from 0.3.14 onwards. Earlier versions work fine, as does CUDA based runners in all versions through to 0.4.0-rc6. ROCm and Metal untested. ```console $ curl localhost:11434/api/version {"version":"0.3.13"} $ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq { "model": "qnguyen3/nanollava", "created_at": "2024-11-01T22:29:39.812672017Z", "message": { "role": "assistant", "content": "An outdoor scene of a river with blue water." }, "done_reason": "stop", "done": true, "total_duration": 4992363967, "load_duration": 1942686202, "prompt_eval_duration": 2707665000, "eval_count": 11, "eval_duration": 272003000 } ``` ```console $ curl localhost:11434/api/version {"version":"0.3.14"} $ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq { "error": "an unknown error was encountered while running the model GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" } ``` ```console $ curl localhost:11434/api/version {"version":"0.4.0-rc6"} $ (echo '{"model":"qnguyen3/nanollava","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq { "error": "POST predict: Post \"http://127.0.0.1:32925/completion\": EOF" } ``` Other llava based models appear unaffected: ```console $ curl localhost:11434/api/version {"version":"0.3.14"} $ (echo '{"model":"llava:7b-v1.5-q4_K_M","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq .message.content " The image captures a serene scene of a large lake, surrounded by trees and hills. A beautiful mountain stream flows into the lake from a higher elevation. The tranquil water reflects the landscape, creating an idyllic setting for relaxation or leisure activities.\n\nA small plant is visible near the water's edge, adding to the natural beauty of the scene. In the background, there are several cars parked at various distances from each other, potentially belonging to visitors enjoying the view and tranquility that this picturesque landscape offers." $ (echo '{"model":"llava-llama3:8b-v1.1-q4_0","options":{"num_gpu":0},"messages":[{"role":"user","content":"Describe this image:","images":["' ; base64 -w0 image.png ; echo '"]}],"stream":false}') | curl -s localhost:11434/api/chat -d @- | jq .message.content "The image captures a serene scene of a lake at sunset. The vantage point is from the shore, looking out towards the calm expanse of water that stretches out to a distant shore adorned with trees and hills. The sky above is painted in hues of blue, gradually transitioning into a warm orange as it meets the horizon. The sun is partially obscured by the hillside in the distance, casting long shadows across the lake's surface.\n\nIn the foreground on the right side of the image, there are small plants peeking out from the ground, adding a touch of green to the scene. On the left side of the image, there's a small boat moored near the shore, perhaps indicating that this is a place where people come to enjoy the tranquility of the lake.\n\nThe overall atmosphere conveyed by the image is one of peace and quietude, as if time itself has slowed down in this particular corner of the world. The precise location of each object - from the boat on the left to the plants on the right, and the hills beyond the water - adds depth to the image, creating a sense of distance and scale.\n\nThere's no text visible in the image, reinforcing the impression that this is a place untouched by modernity or technology. The relative positions of the objects suggest a well-balanced composition, with each element contributing to the overall harmony of the scene. The image doesn't just show what can be seen; it tells a story of a quiet moment frozen in time, a snapshot of nature's beauty undisturbed." ``` The assert happens at this line: https://github.com/ollama/ollama/blob/8a9bb0d000ae8201445ef1a590d7136df0a16f8b/llama/ggml.c#L13425 I believe this is a result of https://github.com/ollama/ollama/commit/f2890a4494f9fb3722ee7a4c506252362d1eab65, which added support for the granite models and as a side effect, bumped to a new version of llama.cpp. This probably needs to be logged with [llama.cpp](https://github.com/ggerganov/llama.cpp/issues). In the meantime, you can either rollback to 0.3.13, try a different model, or get a GPU.
Author
Owner

@kalcao commented on GitHub (Nov 2, 2024):

I have installed 0.3.13 and worked fine. Really thank you for the help!

<!-- gh-comment-id:2452738505 --> @kalcao commented on GitHub (Nov 2, 2024): I have installed 0.3.13 and worked fine. Really thank you for the help!
Author
Owner

@jessegross commented on GitHub (Nov 4, 2024):

We should keep this open to track the issue, even if the fix is ultimately in llama.cpp

<!-- gh-comment-id:2455403016 --> @jessegross commented on GitHub (Nov 4, 2024): We should keep this open to track the issue, even if the fix is ultimately in llama.cpp
Author
Owner

@ccreutzi commented on GitHub (Nov 5, 2024):

Started a corresponding report at https://github.com/ggerganov/llama.cpp/issues/10157. If anyone is faster at reproducing this without Ollama, please feel free to comment or edit there.

<!-- gh-comment-id:2456407360 --> @ccreutzi commented on GitHub (Nov 5, 2024): Started a corresponding report at https://github.com/ggerganov/llama.cpp/issues/10157. If anyone is faster at reproducing this without Ollama, please feel free to comment or edit there.
Author
Owner

@jessegross commented on GitHub (Nov 6, 2024):

Confirmed that this was trigged by the llama.cpp bump, though it might be a latent issue that is only being exposed through stricter error checking. Thanks for narrowing it down @rick-github !

<!-- gh-comment-id:2460677358 --> @jessegross commented on GitHub (Nov 6, 2024): Confirmed that this was trigged by the llama.cpp bump, though it might be a latent issue that is only being exposed through stricter error checking. Thanks for narrowing it down @rick-github !
Author
Owner

@elvizlai commented on GitHub (Dec 10, 2024):

For very long content without correct chunk split, it will cause this strange error.

same content using vLLM, it report

openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 512 tokens. However, you requested 786 tokens in the input for embedding generation. Please reduce the length of the input.", 'type': 'BadRequestError', 'param': None, 'code': 400}

But for Ollama, it server log contains

GGML_ASSERT(i01 >= 0 && i01 < ne01)

POST predict: Post \"http://127.0.0.1:32925/completion\": EOF
<!-- gh-comment-id:2533217799 --> @elvizlai commented on GitHub (Dec 10, 2024): For very long content without correct chunk split, it will cause this strange error. same content using vLLM, it report ``` openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 512 tokens. However, you requested 786 tokens in the input for embedding generation. Please reduce the length of the input.", 'type': 'BadRequestError', 'param': None, 'code': 400} ``` But for Ollama, it server log contains ``` GGML_ASSERT(i01 >= 0 && i01 < ne01) POST predict: Post \"http://127.0.0.1:32925/completion\": EOF ```
Author
Owner

@rick-github commented on GitHub (Dec 10, 2024):

It would be helpful if you could provide the server log, the model, and an example of the input you are using.

<!-- gh-comment-id:2533256212 --> @rick-github commented on GitHub (Dec 10, 2024): It would be helpful if you could provide the server log, the model, and an example of the input you are using.
Author
Owner

@sammyf commented on GitHub (Dec 27, 2024):

This happens for moondream, and only on CPU. ollama version is 0.5.4

sammy@raspberrypi:~ $ ollama run moondream
>>> /home/sammy/ollimca_icon.png 
Added image '/home/sammy/ollimca_icon.png'
Error: POST predict: Post "http://127.0.0.1:35537/completion": EOF

The image or its size doesn't seem to matter
ollimca_icon

<!-- gh-comment-id:2563504387 --> @sammyf commented on GitHub (Dec 27, 2024): This happens for moondream, and only on CPU. ollama version is 0.5.4 ``` sammy@raspberrypi:~ $ ollama run moondream >>> /home/sammy/ollimca_icon.png Added image '/home/sammy/ollimca_icon.png' Error: POST predict: Post "http://127.0.0.1:35537/completion": EOF ``` The image or its size doesn't seem to matter ![ollimca_icon](https://github.com/user-attachments/assets/f67a788e-7c81-4efe-8e49-8195333c47fc)
Author
Owner

@rick-github commented on GitHub (Dec 27, 2024):

It would be helpful if you could provide the server log.

<!-- gh-comment-id:2563626364 --> @rick-github commented on GitHub (Dec 27, 2024): It would be helpful if you could provide the server log.
Author
Owner

@sammyf commented on GitHub (Dec 27, 2024):

Here is the log. That's on a raspberry pi 5 with 8GB. Moondream used to work some times ago ( didn't use it in multiple months)


Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: model name:   vikhyatk/moondream2
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: description:  image encoder for vikhyatk/moondream2
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: GGUF version: 3
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: alignment:    32
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: n_tensors:    457
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: n_kv:         19
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: ftype:        f16
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9>
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   0:                       general.architecture str              = clip
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   4:                          general.file_type u32              = 1
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   5:                               general.name str              = vikhyatk/moondream2
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   7:                        clip.projector_type str              = mlp
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   8:                     clip.vision.image_size u32              = 378
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 2048
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  15:                    clip.vision.block_count u32              = 28
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv  18:                              clip.use_gelu bool             = true
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - type  f32:  285 tensors
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - type  f16:  172 tensors
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: CLIP using CPU backend
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: text_encoder:   0
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: vision_encoder: 1
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: llava_projector:  1
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: minicpmv_projector:  0
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: model size:     867.61 MB
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: metadata size:  0.16 MB
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: params backend buffer size =  867.61 MB (457 tensors)
Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: compute allocated memory: 50.10 MB
Dec 27 15:54:23 raspberrypi ollama[86846]: ggml-cpu.c:8482: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
Dec 27 15:54:24 raspberrypi ollama[86846]: [GIN] 2024/12/27 - 15:54:24 | 200 |      55.203µs |   192.168.0.100 | GET      "/api/ps"
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87149]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87150]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87151]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87152]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87153]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87155]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87154]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87159]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87160]
Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87161]
Dec 27 15:54:24 raspberrypi ollama[87163]: [Thread debugging using libthread_db enabled]
Dec 27 15:54:24 raspberrypi ollama[87163]: Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Dec 27 15:54:24 raspberrypi ollama[87163]: 0x00005555f90b7348 in ggml_barrier ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #0  0x00005555f90b7348 in ggml_barrier ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #1  0x00005555f90c3218 in ?? ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #2  0x00005555f90c5930 in ggml_graph_compute ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #3  0x00005555f9153fb4 in ?? ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #4  0x00005555f9148404 in ggml_backend_graph_compute ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #5  0x00005555f911f89c in clip_image_batch_encode ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #6  0x00005555f912251c in clip_image_encode ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #7  0x00005555f921d018 in llava_image_embed_make_with_clip_img ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #8  0x00005555f921da78 in llava_image_embed_make_with_bytes ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #9  0x00005555f909b690 in _cgo_eb41d09845a5_Cfunc_llava_image_embed_make_with_bytes ()
Dec 27 15:54:24 raspberrypi ollama[87163]: #10 0x00005555f865735c in _start ()
Dec 27 15:54:24 raspberrypi ollama[87163]: Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Dec 27 15:54:24 raspberrypi ollama[87163]: [Inferior 1 (process 87148) detached]
Dec 27 15:54:24 raspberrypi ollama[86846]: [GIN] 2024/12/27 - 15:54:24 | 200 | 28.072215872s |       127.0.0.1 | POST     "/api/chat"

<!-- gh-comment-id:2563777298 --> @sammyf commented on GitHub (Dec 27, 2024): Here is the log. That's on a raspberry pi 5 with 8GB. Moondream used to work some times ago ( didn't use it in multiple months) ``` Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: model name: vikhyatk/moondream2 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: description: image encoder for vikhyatk/moondream2 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: GGUF version: 3 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: alignment: 32 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: n_tensors: 457 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: n_kv: 19 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: ftype: f16 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9> Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 0: general.architecture str = clip Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 1: clip.has_text_encoder bool = false Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 2: clip.has_vision_encoder bool = true Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 3: clip.has_llava_projector bool = true Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 4: general.file_type u32 = 1 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 5: general.name str = vikhyatk/moondream2 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 6: general.description str = image encoder for vikhyatk/moondream2 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 7: clip.projector_type str = mlp Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 8: clip.vision.image_size u32 = 378 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 9: clip.vision.patch_size u32 = 14 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1152 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4304 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 12: clip.vision.projection_dim u32 = 2048 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 15: clip.vision.block_count u32 = 28 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000] Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000] Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - kv 18: clip.use_gelu bool = true Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - type f32: 285 tensors Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: - type f16: 172 tensors Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: CLIP using CPU backend Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: text_encoder: 0 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: vision_encoder: 1 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: llava_projector: 1 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: minicpmv_projector: 0 Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: model size: 867.61 MB Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: metadata size: 0.16 MB Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: params backend buffer size = 867.61 MB (457 tensors) Dec 27 15:54:23 raspberrypi ollama[87148]: clip_model_load: compute allocated memory: 50.10 MB Dec 27 15:54:23 raspberrypi ollama[86846]: ggml-cpu.c:8482: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed Dec 27 15:54:24 raspberrypi ollama[86846]: [GIN] 2024/12/27 - 15:54:24 | 200 | 55.203µs | 192.168.0.100 | GET "/api/ps" Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87149] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87150] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87151] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87152] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87153] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87155] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87154] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87159] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87160] Dec 27 15:54:24 raspberrypi ollama[87163]: [New LWP 87161] Dec 27 15:54:24 raspberrypi ollama[87163]: [Thread debugging using libthread_db enabled] Dec 27 15:54:24 raspberrypi ollama[87163]: Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1". Dec 27 15:54:24 raspberrypi ollama[87163]: 0x00005555f90b7348 in ggml_barrier () Dec 27 15:54:24 raspberrypi ollama[87163]: #0 0x00005555f90b7348 in ggml_barrier () Dec 27 15:54:24 raspberrypi ollama[87163]: #1 0x00005555f90c3218 in ?? () Dec 27 15:54:24 raspberrypi ollama[87163]: #2 0x00005555f90c5930 in ggml_graph_compute () Dec 27 15:54:24 raspberrypi ollama[87163]: #3 0x00005555f9153fb4 in ?? () Dec 27 15:54:24 raspberrypi ollama[87163]: #4 0x00005555f9148404 in ggml_backend_graph_compute () Dec 27 15:54:24 raspberrypi ollama[87163]: #5 0x00005555f911f89c in clip_image_batch_encode () Dec 27 15:54:24 raspberrypi ollama[87163]: #6 0x00005555f912251c in clip_image_encode () Dec 27 15:54:24 raspberrypi ollama[87163]: #7 0x00005555f921d018 in llava_image_embed_make_with_clip_img () Dec 27 15:54:24 raspberrypi ollama[87163]: #8 0x00005555f921da78 in llava_image_embed_make_with_bytes () Dec 27 15:54:24 raspberrypi ollama[87163]: #9 0x00005555f909b690 in _cgo_eb41d09845a5_Cfunc_llava_image_embed_make_with_bytes () Dec 27 15:54:24 raspberrypi ollama[87163]: #10 0x00005555f865735c in _start () Dec 27 15:54:24 raspberrypi ollama[87163]: Backtrace stopped: previous frame identical to this frame (corrupt stack?) Dec 27 15:54:24 raspberrypi ollama[87163]: [Inferior 1 (process 87148) detached] Dec 27 15:54:24 raspberrypi ollama[86846]: [GIN] 2024/12/27 - 15:54:24 | 200 | 28.072215872s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@rick-github commented on GitHub (Dec 28, 2024):

Looks like the same issue. You can either rollback to 0.3.13, try a different model, or get a GPU.

<!-- gh-comment-id:2564196908 --> @rick-github commented on GitHub (Dec 28, 2024): Looks like the same issue. You can either rollback to 0.3.13, try a different model, or [get a GPU](https://www.jeffgeerling.com/blog/2024/use-external-gpu-on-raspberry-pi-5-4k-gaming).
Author
Owner

@sammyf commented on GitHub (Dec 29, 2024):

Looks like the same issue. You can either rollback to 0.3.13, try a different model, or get a GPU.

If I may make a suggestion: close this and make a sticky somewhere about moondream ( and possibly other models ) not running on CPU-bound systems due to upstream issues.

( also, 0.3.13 had a bug with embedding that was resolved in 0.3.14, there are no other vision models able to run in very low RAM environment, and ... a GPU? On a RPi5 ;)

<!-- gh-comment-id:2564638606 --> @sammyf commented on GitHub (Dec 29, 2024): > Looks like the same issue. You can either rollback to 0.3.13, try a different model, or [get a GPU](https://www.jeffgeerling.com/blog/2024/use-external-gpu-on-raspberry-pi-5-4k-gaming). If I may make a suggestion: close this and make a sticky somewhere about moondream ( and possibly other models ) not running on CPU-bound systems due to upstream issues. ( also, 0.3.13 had a bug with embedding that was resolved in 0.3.14, there are no other vision models able to run in very low RAM environment, and ... a GPU? On a RPi5 ;)
Author
Owner

@rick-github commented on GitHub (Dec 29, 2024):

ollama team have chosen to keep this issue open for tracking.

and ... a GPU? On a RPi5 ;)

Did you click through the link?

<!-- gh-comment-id:2564646705 --> @rick-github commented on GitHub (Dec 29, 2024): ollama team have chosen to keep this issue open for tracking. > and ... a GPU? On a RPi5 ;) Did you click through the link?
Author
Owner

@bjonnh commented on GitHub (Jan 2, 2025):

I removed the assert in ggml-cpu.c and that's working now (not a long term solution, maybe you want to change the expected version somehow?)

<!-- gh-comment-id:2567235282 --> @bjonnh commented on GitHub (Jan 2, 2025): I removed the assert in ggml-cpu.c and that's working now (not a long term solution, maybe you want to change the expected version somehow?)
Author
Owner

@Justin-12138 commented on GitHub (Jan 14, 2025):

@rick-github Could you please have a look the error that I encountered
I got a embedding model :bge-large:latest

Image

Image

And I my server 's log:
time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"

time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"

time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"

time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.682952212s | 10.88.128.143 | POST "/api/embed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.912196085s | 10.88.128.143 | POST "/api/embed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.917658805s | 10.88.128.143 | POST "/api/embed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.91196088s | 10.88.128.143 | POST "/api/embed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.912470179s | 10.88.128.143 | POST "/api/embed"

time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"

time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.919251593s | 10.88.128.143 | POST "/api/embed"

[GIN] 2025/01/14 - 15:49:31 | 500 | 21.916054131s | 10.88.128.143 | POST "/api/embed"

time=2025-01-14T15:49:38.685Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.134886073 model=/root/.ollama/models/blobs/sha256-92b37e50807d951e27ead73c059cf9c3b14941498e37dfde57271e19e6d411df

<!-- gh-comment-id:2590309014 --> @Justin-12138 commented on GitHub (Jan 14, 2025): @rick-github Could you please have a look the error that I encountered I got a embedding model :bge-large:latest ![Image](https://github.com/user-attachments/assets/f47fbdef-dcbf-476c-acb5-b66805f06360) ![Image](https://github.com/user-attachments/assets/89094a10-acc2-41bf-a105-ffbc7a181b0f) And I my server 's log: time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.682952212s | 10.88.128.143 | POST "/api/embed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.912196085s | 10.88.128.143 | POST "/api/embed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.917658805s | 10.88.128.143 | POST "/api/embed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.91196088s | 10.88.128.143 | POST "/api/embed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.912470179s | 10.88.128.143 | POST "/api/embed" time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" time=2025-01-14T15:49:31.500Z level=ERROR source=routes.go:473 msg="embedding generation failed" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed\nggml-cpu.c:8539: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.919251593s | 10.88.128.143 | POST "/api/embed" [GIN] 2025/01/14 - 15:49:31 | 500 | 21.916054131s | 10.88.128.143 | POST "/api/embed" time=2025-01-14T15:49:38.685Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.134886073 model=/root/.ollama/models/blobs/sha256-92b37e50807d951e27ead73c059cf9c3b14941498e37dfde57271e19e6d411df
Author
Owner

@rick-github commented on GitHub (Jan 15, 2025):

Your problem is different, it is this one: https://github.com/ollama/ollama/issues/7288.

The problem is that the context length that ollama is using is longer than the context length that the model supports. ollama is using the default of 2048, and bge-large:latest has a context length of 512:

$ ollama show bge-large:latest
  Model
    architecture        bert       
    parameters          334.09M    
    context length      512        
    embedding length    1024       
    quantization        F16        

You can prevent these errors by setting "options":{"num_ctx":512} in the API call, or modifying the model to specify the context length:

ollama cp bge-large:latest bge-large:original
ollama rm bge-large:latest
ollama show --modelfile bge-large:original > Modelfile
echo PARAMETER num_ctx 512 >> Modelfile
ollama create -f Modelfile bge-large:latest

Note that the reason the errors are occurring is because you are getting embeddings for text lengths greater than that supported by the model. This means that the text will be truncated and the embeddings will be losing semantic content. You should adjust the chunk size of your embedding client to less than 512.

<!-- gh-comment-id:2591409256 --> @rick-github commented on GitHub (Jan 15, 2025): Your problem is different, it is this one: https://github.com/ollama/ollama/issues/7288. The problem is that the context length that ollama is using is longer than the context length that the model supports. ollama is using the default of 2048, and bge-large:latest has a context length of 512: ```console $ ollama show bge-large:latest Model architecture bert parameters 334.09M context length 512 embedding length 1024 quantization F16 ``` You can prevent these errors by setting `"options":{"num_ctx":512}` in the API call, or modifying the model to specify the context length: ```console ollama cp bge-large:latest bge-large:original ollama rm bge-large:latest ollama show --modelfile bge-large:original > Modelfile echo PARAMETER num_ctx 512 >> Modelfile ollama create -f Modelfile bge-large:latest ``` Note that the reason the errors are occurring is because you are getting embeddings for text lengths greater than that supported by the model. This means that the text will be truncated and the embeddings will be losing semantic content. You should adjust the chunk size of your embedding client to less than 512.
Author
Owner

@Justin-12138 commented on GitHub (Jan 15, 2025):

@rick-github Thanks,I tried the "options":{"num_ctx":512} ,It works well! 💯
but the logs always shows that

Image

<!-- gh-comment-id:2591480334 --> @Justin-12138 commented on GitHub (Jan 15, 2025): @rick-github Thanks,I tried the "options":{"num_ctx":512} ,It works well! 💯 but the logs always shows that ![Image](https://github.com/user-attachments/assets/ac0c0a66-aa0e-44df-bb25-fc387a09565f)
Author
Owner

@rick-github commented on GitHub (Jan 15, 2025):

Follow up in #8431

<!-- gh-comment-id:2591546105 --> @rick-github commented on GitHub (Jan 15, 2025): Follow up in #8431
Author
Owner

@jobnomade commented on GitHub (Jan 24, 2025):

This happens for moondream, and only on CPU. ollama version is 0.5.4

sammy@raspberrypi:~ $ ollama run moondream
>>> /home/sammy/ollimca_icon.png 
Added image '/home/sammy/ollimca_icon.png'
Error: POST predict: Post "http://127.0.0.1:35537/completion": EOF

The image or its size doesn't seem to matter ollimca_icon

I have the same observation. Wrapped my head around why it is crashing. (Tested on my local M2 MacBook) and on a 32 vCPU and 64 GB Ram server moondream2 crashed.

My Ollama version is 0.5.7-0-ga420a45-dirty (docker image).

As I understand, there is no solution but to rollback to an old version of ollama 0.3.13 or to get a GPU, correct?

My ollama server logs:

ollama    | time=2025-01-24T13:52:26.390Z level=INFO source=server.go:104 msg="system memory" total="62.9 GiB" free="58.0 GiB" free_swap="8.0 GiB"
ollama    | time=2025-01-24T13:52:26.392Z level=INFO source=memory.go:356 msg="offload to cpu" projector.weights="867.6 MiB" projector.graph="0 B" layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[58.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="0 B" memory.required.kv="1.5 GiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.1 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="82.2 MiB" memory.graph.full="544.0 MiB" memory.graph.partial="540.0 MiB"
ollama    | time=2025-01-24T13:52:26.392Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-e554c6b9de016673fd2c732e0342967727e9659ca5f853a4947cc96263fa602b --ctx-size 8192 --batch-size 512 --mmproj /root/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9f6b2856f305e39c38eed8f --threads 32 --no-mmap --parallel 4 --port 39517"
ollama    | time=2025-01-24T13:52:26.392Z level=INFO source=sched.go:449 msg="loaded runners" count=2
ollama    | time=2025-01-24T13:52:26.392Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
ollama    | time=2025-01-24T13:52:26.393Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
ollama    | time=2025-01-24T13:52:26.397Z level=INFO source=runner.go:936 msg="starting go runner"
ollama    | time=2025-01-24T13:52:26.400Z level=INFO source=runner.go:937 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=32
ollama    | time=2025-01-24T13:52:26.400Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39517"
ollama    | llama_model_loader: loaded meta data with 20 key-value pairs and 245 tensors from /root/.ollama/models/blobs/sha256-e554c6b9de016673fd2c732e0342967727e9659ca5f853a4947cc96263fa602b (version GGUF V3 (latest))
ollama    | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama    | llama_model_loader: - kv   0:                       general.architecture str              = phi2
ollama    | llama_model_loader: - kv   1:                               general.name str              = moondream2
ollama    | llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
ollama    | llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2048
ollama    | llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 8192
ollama    | llama_model_loader: - kv   5:                           phi2.block_count u32              = 24
ollama    | llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
ollama    | llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
ollama    | llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
ollama    | llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
ollama    | llama_model_loader: - kv  10:                          general.file_type u32              = 2
ollama    | llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
ollama    | llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
ollama    | llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama    | llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama    | llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["�� t", "�� a", "h e", "i n", "r e",...
ollama    | llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
ollama    | llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
ollama    | llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
ollama    | llama_model_loader: - kv  19:               general.quantization_version u32              = 2
ollama    | llama_model_loader: - type  f32:  147 tensors
ollama    | llama_model_loader: - type q4_0:   97 tensors
ollama    | llama_model_loader: - type q6_K:    1 tensors
ollama    | llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
ollama    | llm_load_vocab: special tokens cache size = 944
ollama    | llm_load_vocab: token to piece cache size = 0.3151 MB
ollama    | llm_load_print_meta: format           = GGUF V3 (latest)
ollama    | llm_load_print_meta: arch             = phi2
ollama    | llm_load_print_meta: vocab type       = BPE
ollama    | llm_load_print_meta: n_vocab          = 51200
ollama    | llm_load_print_meta: n_merges         = 50000
ollama    | llm_load_print_meta: vocab_only       = 0
ollama    | llm_load_print_meta: n_ctx_train      = 2048
ollama    | llm_load_print_meta: n_embd           = 2048
ollama    | llm_load_print_meta: n_layer          = 24
ollama    | llm_load_print_meta: n_head           = 32
ollama    | llm_load_print_meta: n_head_kv        = 32
ollama    | llm_load_print_meta: n_rot            = 32
ollama    | llm_load_print_meta: n_swa            = 0
ollama    | llm_load_print_meta: n_embd_head_k    = 64
ollama    | llm_load_print_meta: n_embd_head_v    = 64
ollama    | llm_load_print_meta: n_gqa            = 1
ollama    | llm_load_print_meta: n_embd_k_gqa     = 2048
ollama    | llm_load_print_meta: n_embd_v_gqa     = 2048
ollama    | llm_load_print_meta: f_norm_eps       = 1.0e-05
ollama    | llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
ollama    | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
ollama    | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama    | llm_load_print_meta: f_logit_scale    = 0.0e+00
ollama    | llm_load_print_meta: n_ff             = 8192
ollama    | llm_load_print_meta: n_expert         = 0
ollama    | llm_load_print_meta: n_expert_used    = 0
ollama    | llm_load_print_meta: causal attn      = 1
ollama    | llm_load_print_meta: pooling type     = 0
ollama    | llm_load_print_meta: rope type        = 2
ollama    | llm_load_print_meta: rope scaling     = linear
ollama    | llm_load_print_meta: freq_base_train  = 10000.0
ollama    | llm_load_print_meta: freq_scale_train = 1
ollama    | llm_load_print_meta: n_ctx_orig_yarn  = 2048
ollama    | llm_load_print_meta: rope_finetuned   = unknown
ollama    | llm_load_print_meta: ssm_d_conv       = 0
ollama    | llm_load_print_meta: ssm_d_inner      = 0
ollama    | llm_load_print_meta: ssm_d_state      = 0
ollama    | llm_load_print_meta: ssm_dt_rank      = 0
ollama    | llm_load_print_meta: ssm_dt_b_c_rms   = 0
ollama    | llm_load_print_meta: model type       = 1B
ollama    | llm_load_print_meta: model ftype      = Q4_0
ollama    | llm_load_print_meta: model params     = 1.42 B
ollama    | llm_load_print_meta: model size       = 788.55 MiB (4.66 BPW)
ollama    | llm_load_print_meta: general.name     = moondream2
ollama    | llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
ollama    | llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
ollama    | llm_load_print_meta: EOT token        = 50256 '<|endoftext|>'
ollama    | llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
ollama    | llm_load_print_meta: LF token         = 128 '��'
ollama    | llm_load_print_meta: EOG token        = 50256 '<|endoftext|>'
ollama    | llm_load_print_meta: max token length = 256
ollama    | llm_load_tensors:          CPU model buffer size =   140.55 MiB
ollama    | llm_load_tensors:  CPU_AARCH64 model buffer size =   648.00 MiB
ollama    | time=2025-01-24T13:52:26.644Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
ollama    | llama_new_context_with_model: n_seq_max     = 4
ollama    | llama_new_context_with_model: n_ctx         = 8192
ollama    | llama_new_context_with_model: n_ctx_per_seq = 2048
ollama    | llama_new_context_with_model: n_batch       = 2048
ollama    | llama_new_context_with_model: n_ubatch      = 512
ollama    | llama_new_context_with_model: flash_attn    = 0
ollama    | llama_new_context_with_model: freq_base     = 10000.0
ollama    | llama_new_context_with_model: freq_scale    = 1
ollama    | llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
ollama    | llama_kv_cache_init:        CPU KV buffer size =  1536.00 MiB
ollama    | llama_new_context_with_model: KV self size  = 1536.00 MiB, K (f16):  768.00 MiB, V (f16):  768.00 MiB
ollama    | llama_new_context_with_model:        CPU  output buffer size =     0.81 MiB
ollama    | llama_new_context_with_model:        CPU compute buffer size =   556.01 MiB
ollama    | llama_new_context_with_model: graph nodes  = 921
ollama    | llama_new_context_with_model: graph splits = 1
ollama    | clip_model_load: model name:   vikhyatk/moondream2
ollama    | clip_model_load: description:  image encoder for vikhyatk/moondream2
ollama    | clip_model_load: GGUF version: 3
ollama    | clip_model_load: alignment:    32
ollama    | clip_model_load: n_tensors:    457
ollama    | clip_model_load: n_kv:         19
ollama    | clip_model_load: ftype:        f16
ollama    |
ollama    | clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from /root/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9f6b2856f305e39c38eed8f
ollama    | clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama    | clip_model_load: - kv   0:                       general.architecture str              = clip
ollama    | clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
ollama    | clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
ollama    | clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
ollama    | clip_model_load: - kv   4:                          general.file_type u32              = 1
ollama    | clip_model_load: - kv   5:                               general.name str              = vikhyatk/moondream2
ollama    | clip_model_load: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
ollama    | clip_model_load: - kv   7:                        clip.projector_type str              = mlp
ollama    | clip_model_load: - kv   8:                     clip.vision.image_size u32              = 378
ollama    | clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
ollama    | clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
ollama    | clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
ollama    | clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 2048
ollama    | clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
ollama    | clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
ollama    | clip_model_load: - kv  15:                    clip.vision.block_count u32              = 28
ollama    | clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
ollama    | clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
ollama    | clip_model_load: - kv  18:                              clip.use_gelu bool             = true
ollama    | clip_model_load: - type  f32:  285 tensors
ollama    | clip_model_load: - type  f16:  172 tensors
ollama    | clip_model_load: CLIP using CPU backend
ollama    | key clip.use_silu not found in file
ollama    | clip_model_load: text_encoder:   0
ollama    | clip_model_load: vision_encoder: 1
ollama    | clip_model_load: llava_projector:  1
ollama    | clip_model_load: minicpmv_projector:  0
ollama    | clip_model_load: model size:     867.61 MB
ollama    | clip_model_load: metadata size:  0.16 MB
ollama    | clip_model_load: params backend buffer size =  867.61 MB (457 tensors)
ollama    | key clip.vision.image_grid_pinpoints not found in file
ollama    | key clip.vision.mm_patch_merge_type not found in file
ollama    | key clip.vision.image_crop_resolution not found in file
ollama    | clip_model_load: compute allocated memory: 50.10 MB
ollama    | time=2025-01-24T13:52:28.904Z level=INFO source=server.go:594 msg="llama runner started in 2.51 seconds"
ollama    | ggml-cpu.c:8482: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed

What I find a bit strange is, when I try to find help / documentation I found on https://github.com/vikhyat/moondream/blob/main/README.md the following comment from the author.

⚠️ Note: The Python client currently only supports CPU inference. CUDA (GPU) and MPS (Apple Silicon) optimization is coming soon. For GPU support, use the Hugging Face transformers implementation below.

It is a bit misleading though when CPU is not supported.

Thanks for the awesome work and for Ollama!

<!-- gh-comment-id:2612623866 --> @jobnomade commented on GitHub (Jan 24, 2025): > This happens for moondream, and only on CPU. ollama version is 0.5.4 > > ``` > sammy@raspberrypi:~ $ ollama run moondream > >>> /home/sammy/ollimca_icon.png > Added image '/home/sammy/ollimca_icon.png' > Error: POST predict: Post "http://127.0.0.1:35537/completion": EOF > ``` > > The image or its size doesn't seem to matter ![ollimca_icon](https://github.com/user-attachments/assets/f67a788e-7c81-4efe-8e49-8195333c47fc) I have the same observation. Wrapped my head around why it is crashing. (Tested on my local M2 MacBook) and on a 32 vCPU and 64 GB Ram server moondream2 crashed. My Ollama version is 0.5.7-0-ga420a45-dirty (docker image). As I understand, there is no solution but to rollback to an old version of ollama 0.3.13 or to get a GPU, correct? My ollama server logs: ```log ollama | time=2025-01-24T13:52:26.390Z level=INFO source=server.go:104 msg="system memory" total="62.9 GiB" free="58.0 GiB" free_swap="8.0 GiB" ollama | time=2025-01-24T13:52:26.392Z level=INFO source=memory.go:356 msg="offload to cpu" projector.weights="867.6 MiB" projector.graph="0 B" layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[58.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="0 B" memory.required.kv="1.5 GiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.1 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="82.2 MiB" memory.graph.full="544.0 MiB" memory.graph.partial="540.0 MiB" ollama | time=2025-01-24T13:52:26.392Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-e554c6b9de016673fd2c732e0342967727e9659ca5f853a4947cc96263fa602b --ctx-size 8192 --batch-size 512 --mmproj /root/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9f6b2856f305e39c38eed8f --threads 32 --no-mmap --parallel 4 --port 39517" ollama | time=2025-01-24T13:52:26.392Z level=INFO source=sched.go:449 msg="loaded runners" count=2 ollama | time=2025-01-24T13:52:26.392Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" ollama | time=2025-01-24T13:52:26.393Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" ollama | time=2025-01-24T13:52:26.397Z level=INFO source=runner.go:936 msg="starting go runner" ollama | time=2025-01-24T13:52:26.400Z level=INFO source=runner.go:937 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=32 ollama | time=2025-01-24T13:52:26.400Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39517" ollama | llama_model_loader: loaded meta data with 20 key-value pairs and 245 tensors from /root/.ollama/models/blobs/sha256-e554c6b9de016673fd2c732e0342967727e9659ca5f853a4947cc96263fa602b (version GGUF V3 (latest)) ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ollama | llama_model_loader: - kv 0: general.architecture str = phi2 ollama | llama_model_loader: - kv 1: general.name str = moondream2 ollama | llama_model_loader: - kv 2: phi2.context_length u32 = 2048 ollama | llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048 ollama | llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192 ollama | llama_model_loader: - kv 5: phi2.block_count u32 = 24 ollama | llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32 ollama | llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32 ollama | llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010 ollama | llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32 ollama | llama_model_loader: - kv 10: general.file_type u32 = 2 ollama | llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false ollama | llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2 ollama | llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ... ollama | llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ollama | llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["�� t", "�� a", "h e", "i n", "r e",... ollama | llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256 ollama | llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256 ollama | llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256 ollama | llama_model_loader: - kv 19: general.quantization_version u32 = 2 ollama | llama_model_loader: - type f32: 147 tensors ollama | llama_model_loader: - type q4_0: 97 tensors ollama | llama_model_loader: - type q6_K: 1 tensors ollama | llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default' ollama | llm_load_vocab: special tokens cache size = 944 ollama | llm_load_vocab: token to piece cache size = 0.3151 MB ollama | llm_load_print_meta: format = GGUF V3 (latest) ollama | llm_load_print_meta: arch = phi2 ollama | llm_load_print_meta: vocab type = BPE ollama | llm_load_print_meta: n_vocab = 51200 ollama | llm_load_print_meta: n_merges = 50000 ollama | llm_load_print_meta: vocab_only = 0 ollama | llm_load_print_meta: n_ctx_train = 2048 ollama | llm_load_print_meta: n_embd = 2048 ollama | llm_load_print_meta: n_layer = 24 ollama | llm_load_print_meta: n_head = 32 ollama | llm_load_print_meta: n_head_kv = 32 ollama | llm_load_print_meta: n_rot = 32 ollama | llm_load_print_meta: n_swa = 0 ollama | llm_load_print_meta: n_embd_head_k = 64 ollama | llm_load_print_meta: n_embd_head_v = 64 ollama | llm_load_print_meta: n_gqa = 1 ollama | llm_load_print_meta: n_embd_k_gqa = 2048 ollama | llm_load_print_meta: n_embd_v_gqa = 2048 ollama | llm_load_print_meta: f_norm_eps = 1.0e-05 ollama | llm_load_print_meta: f_norm_rms_eps = 0.0e+00 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 ollama | llm_load_print_meta: n_ff = 8192 ollama | llm_load_print_meta: n_expert = 0 ollama | llm_load_print_meta: n_expert_used = 0 ollama | llm_load_print_meta: causal attn = 1 ollama | llm_load_print_meta: pooling type = 0 ollama | llm_load_print_meta: rope type = 2 ollama | llm_load_print_meta: rope scaling = linear ollama | llm_load_print_meta: freq_base_train = 10000.0 ollama | llm_load_print_meta: freq_scale_train = 1 ollama | llm_load_print_meta: n_ctx_orig_yarn = 2048 ollama | llm_load_print_meta: rope_finetuned = unknown ollama | llm_load_print_meta: ssm_d_conv = 0 ollama | llm_load_print_meta: ssm_d_inner = 0 ollama | llm_load_print_meta: ssm_d_state = 0 ollama | llm_load_print_meta: ssm_dt_rank = 0 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 ollama | llm_load_print_meta: model type = 1B ollama | llm_load_print_meta: model ftype = Q4_0 ollama | llm_load_print_meta: model params = 1.42 B ollama | llm_load_print_meta: model size = 788.55 MiB (4.66 BPW) ollama | llm_load_print_meta: general.name = moondream2 ollama | llm_load_print_meta: BOS token = 50256 '<|endoftext|>' ollama | llm_load_print_meta: EOS token = 50256 '<|endoftext|>' ollama | llm_load_print_meta: EOT token = 50256 '<|endoftext|>' ollama | llm_load_print_meta: UNK token = 50256 '<|endoftext|>' ollama | llm_load_print_meta: LF token = 128 '��' ollama | llm_load_print_meta: EOG token = 50256 '<|endoftext|>' ollama | llm_load_print_meta: max token length = 256 ollama | llm_load_tensors: CPU model buffer size = 140.55 MiB ollama | llm_load_tensors: CPU_AARCH64 model buffer size = 648.00 MiB ollama | time=2025-01-24T13:52:26.644Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" ollama | llama_new_context_with_model: n_seq_max = 4 ollama | llama_new_context_with_model: n_ctx = 8192 ollama | llama_new_context_with_model: n_ctx_per_seq = 2048 ollama | llama_new_context_with_model: n_batch = 2048 ollama | llama_new_context_with_model: n_ubatch = 512 ollama | llama_new_context_with_model: flash_attn = 0 ollama | llama_new_context_with_model: freq_base = 10000.0 ollama | llama_new_context_with_model: freq_scale = 1 ollama | llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1 ollama | llama_kv_cache_init: CPU KV buffer size = 1536.00 MiB ollama | llama_new_context_with_model: KV self size = 1536.00 MiB, K (f16): 768.00 MiB, V (f16): 768.00 MiB ollama | llama_new_context_with_model: CPU output buffer size = 0.81 MiB ollama | llama_new_context_with_model: CPU compute buffer size = 556.01 MiB ollama | llama_new_context_with_model: graph nodes = 921 ollama | llama_new_context_with_model: graph splits = 1 ollama | clip_model_load: model name: vikhyatk/moondream2 ollama | clip_model_load: description: image encoder for vikhyatk/moondream2 ollama | clip_model_load: GGUF version: 3 ollama | clip_model_load: alignment: 32 ollama | clip_model_load: n_tensors: 457 ollama | clip_model_load: n_kv: 19 ollama | clip_model_load: ftype: f16 ollama | ollama | clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from /root/.ollama/models/blobs/sha256-4cc1cb3660d87ff56432ebeb7884ad35d67c48c7b9f6b2856f305e39c38eed8f ollama | clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ollama | clip_model_load: - kv 0: general.architecture str = clip ollama | clip_model_load: - kv 1: clip.has_text_encoder bool = false ollama | clip_model_load: - kv 2: clip.has_vision_encoder bool = true ollama | clip_model_load: - kv 3: clip.has_llava_projector bool = true ollama | clip_model_load: - kv 4: general.file_type u32 = 1 ollama | clip_model_load: - kv 5: general.name str = vikhyatk/moondream2 ollama | clip_model_load: - kv 6: general.description str = image encoder for vikhyatk/moondream2 ollama | clip_model_load: - kv 7: clip.projector_type str = mlp ollama | clip_model_load: - kv 8: clip.vision.image_size u32 = 378 ollama | clip_model_load: - kv 9: clip.vision.patch_size u32 = 14 ollama | clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1152 ollama | clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4304 ollama | clip_model_load: - kv 12: clip.vision.projection_dim u32 = 2048 ollama | clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16 ollama | clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001 ollama | clip_model_load: - kv 15: clip.vision.block_count u32 = 28 ollama | clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000] ollama | clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000] ollama | clip_model_load: - kv 18: clip.use_gelu bool = true ollama | clip_model_load: - type f32: 285 tensors ollama | clip_model_load: - type f16: 172 tensors ollama | clip_model_load: CLIP using CPU backend ollama | key clip.use_silu not found in file ollama | clip_model_load: text_encoder: 0 ollama | clip_model_load: vision_encoder: 1 ollama | clip_model_load: llava_projector: 1 ollama | clip_model_load: minicpmv_projector: 0 ollama | clip_model_load: model size: 867.61 MB ollama | clip_model_load: metadata size: 0.16 MB ollama | clip_model_load: params backend buffer size = 867.61 MB (457 tensors) ollama | key clip.vision.image_grid_pinpoints not found in file ollama | key clip.vision.mm_patch_merge_type not found in file ollama | key clip.vision.image_crop_resolution not found in file ollama | clip_model_load: compute allocated memory: 50.10 MB ollama | time=2025-01-24T13:52:28.904Z level=INFO source=server.go:594 msg="llama runner started in 2.51 seconds" ollama | ggml-cpu.c:8482: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed ``` What I find a bit strange is, when I try to find help / documentation I found on https://github.com/vikhyat/moondream/blob/main/README.md the following comment from the author. > ⚠️ Note: The Python client currently only supports CPU inference. CUDA (GPU) and MPS (Apple Silicon) optimization is coming soon. For GPU support, use the Hugging Face transformers implementation below. It is a bit misleading though when CPU is not supported. Thanks for the awesome work and for Ollama!
Author
Owner

@rick-github commented on GitHub (Jan 24, 2025):

As I understand, there is no solution but to rollback to an old version of ollama 0.3.13 or to get a GPU, correct?

Correct. The llama.cpp issue has been closed as stale, as nobody has the time to dig into it. You can follow bjonnh's example and build a custom version with the assert removed, although this may have other issues.

<!-- gh-comment-id:2612706149 --> @rick-github commented on GitHub (Jan 24, 2025): > As I understand, there is no solution but to rollback to an old version of ollama 0.3.13 or to get a GPU, correct? Correct. The llama.cpp issue has been closed as stale, as nobody has the time to dig into it. You can follow bjonnh's example and build a custom version with the `assert` removed, although this may have other issues.
Author
Owner

@jobnomade commented on GitHub (Jan 24, 2025):

I switched now to llava-phi3 model https://ollama.com/library/llava-phi3:3.8b-mini-q4_0. It works on CPU.

I am not a C++ dev or have not deep experience in llama.cpp. I prompted the issue in Cursor and it provided me this walk through, maybe of help. If not, ignore it.

  1. The Specific Error Point:
GGML_ASSERT(i01 >= 0 && i01 < ne01);

This assertion is failing in ggml-cpu.c, which suggests that an index i01 is either negative or exceeding the expected tensor dimension ne01. This occurs during tensor operations, specifically during the CLIP vision encoder's attention mechanism.

  1. Analyzing the Model Architecture:
    From the logs:
clip_model_load: vision_encoder: 1
clip.vision.image_size u32              = 378
clip.vision.patch_size u32              = 14
clip.vision.embedding_length u32        = 1152
clip.vision.attention.head_count u32    = 16

The model uses a CLIP vision encoder with:

  • Input image size: 378x378
  • Patch size: 14x14
  • Embedding dimension: 1152
  • Number of attention heads: 16
  1. Potential Root Causes:

a. Tensor Shape Mismatch:

  • The assertion is likely failing because of a dimension mismatch during attention computation
  • The attention mechanism expects certain tensor shapes based on these parameters:
    • Number of patches: (378/14)² = 27x27 = 729 patches
    • Attention head dimension: 1152/16 = 72 dimensions per head

b. Quantization Issues:

llm_load_print_meta: model ftype      = Q4_0
  • The model is using Q4_0 quantization
  • This aggressive quantization might be causing precision issues that affect tensor dimensions

c. Memory Layout:

llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
  • The context size configuration might be causing memory alignment issues during CPU computation
  1. Why It's CPU-Specific:
  • GPU implementations often have more flexible memory handling
  • CPU implementations need stricter bounds checking
  • The assertion might be too strict for the CPU implementation's memory layout
  1. Technical Analysis:
// Pseudo-code of what might be happening
for (int i01 = 0; i01 < ne01; i01++) {
    // During attention computation
    // i01 represents position in sequence
    // ne01 is expected sequence length
    GGML_ASSERT(i01 >= 0 && i01 < ne01);  // This fails
}

The likely scenarios are:

  1. The attention computation is trying to access positions beyond the expected sequence length
  2. The tensor dimensions are not properly aligned after quantization
  3. The memory layout assumptions in the CPU implementation don't match the model's requirements

Recommended Solutions:

  1. Proper Fix Would Involve:
// Add dimension checking before computation
if (ne01 != expected_sequence_length) {
    // Realign dimensions or raise proper error
}

// Or add padding/truncation handling
i01 = min(i01, ne01 - 1);
  1. Model-side Fix:
  • Ensure tensor dimensions are properly aligned
  • Add proper padding handling
  • Validate sequence lengths before computation
  1. Runtime Fix:
  • Add proper dimension validation
  • Implement dynamic padding
  • Add proper error handling for dimension mismatches
<!-- gh-comment-id:2612876732 --> @jobnomade commented on GitHub (Jan 24, 2025): I switched now to llava-phi3 model https://ollama.com/library/llava-phi3:3.8b-mini-q4_0. It works on CPU. I am not a C++ dev or have not deep experience in llama.cpp. I prompted the issue in Cursor and it provided me this walk through, maybe of help. If not, ignore it. 1. **The Specific Error Point**: ```c GGML_ASSERT(i01 >= 0 && i01 < ne01); ``` This assertion is failing in `ggml-cpu.c`, which suggests that an index `i01` is either negative or exceeding the expected tensor dimension `ne01`. This occurs during tensor operations, specifically during the CLIP vision encoder's attention mechanism. 2. **Analyzing the Model Architecture**: From the logs: ``` clip_model_load: vision_encoder: 1 clip.vision.image_size u32 = 378 clip.vision.patch_size u32 = 14 clip.vision.embedding_length u32 = 1152 clip.vision.attention.head_count u32 = 16 ``` The model uses a CLIP vision encoder with: - Input image size: 378x378 - Patch size: 14x14 - Embedding dimension: 1152 - Number of attention heads: 16 3. **Potential Root Causes**: a. **Tensor Shape Mismatch**: - The assertion is likely failing because of a dimension mismatch during attention computation - The attention mechanism expects certain tensor shapes based on these parameters: - Number of patches: (378/14)² = 27x27 = 729 patches - Attention head dimension: 1152/16 = 72 dimensions per head b. **Quantization Issues**: ``` llm_load_print_meta: model ftype = Q4_0 ``` - The model is using Q4_0 quantization - This aggressive quantization might be causing precision issues that affect tensor dimensions c. **Memory Layout**: ``` llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_ctx_per_seq = 2048 ``` - The context size configuration might be causing memory alignment issues during CPU computation 4. **Why It's CPU-Specific**: - GPU implementations often have more flexible memory handling - CPU implementations need stricter bounds checking - The assertion might be too strict for the CPU implementation's memory layout 5. **Technical Analysis**: ```c // Pseudo-code of what might be happening for (int i01 = 0; i01 < ne01; i01++) { // During attention computation // i01 represents position in sequence // ne01 is expected sequence length GGML_ASSERT(i01 >= 0 && i01 < ne01); // This fails } ``` The likely scenarios are: 1. The attention computation is trying to access positions beyond the expected sequence length 2. The tensor dimensions are not properly aligned after quantization 3. The memory layout assumptions in the CPU implementation don't match the model's requirements **Recommended Solutions**: 1. **Proper Fix Would Involve**: ```c // Add dimension checking before computation if (ne01 != expected_sequence_length) { // Realign dimensions or raise proper error } // Or add padding/truncation handling i01 = min(i01, ne01 - 1); ``` 2. **Model-side Fix**: - Ensure tensor dimensions are properly aligned - Add proper padding handling - Validate sequence lengths before computation 3. **Runtime Fix**: - Add proper dimension validation - Implement dynamic padding - Add proper error handling for dimension mismatches
Author
Owner

@alex-jw-brooks commented on GitHub (Feb 20, 2025):

I ran into this issue while adding support for granite vision to llama cpp / ollama and have opened a fix in llama.cpp here.

The issue is that the patches vector that is used to grab the rows from the visual features right before the projector uses values from [1, ... num_features], where 0 is skipped to handle the CLS feature. In the case of visual encoders like siglip, which have no CLS, this causes the following situation:

  • Siglip has embedding dim 729
  • Patches is initialized with values [1, ..., 729] because because of the hardcoded +1 for CLS
  • But then, since there is no CLS, 729 is out of range

I've verified the fix by testing in both llama cpp / ollama with NanoLlava and granite vision 🙂

<!-- gh-comment-id:2672544088 --> @alex-jw-brooks commented on GitHub (Feb 20, 2025): I ran into this issue while adding support for granite vision to llama cpp / ollama and have opened a fix in `llama.cpp` [here](https://github.com/ggml-org/llama.cpp/pull/11982). The issue is that the `patches` vector that is used to grab the rows from the visual features right before the projector uses values from `[1, ... num_features]`, where 0 is skipped to handle the CLS feature. In the case of visual encoders like siglip, which have no CLS, this causes the following situation: - Siglip has embedding dim 729 - Patches is initialized with values [1, ..., 729] because because of the hardcoded `+1` for CLS - But then, since there is no CLS, `729` is out of range I've verified the fix by testing in both llama cpp / ollama with NanoLlava and granite vision 🙂
Author
Owner

@jessegross commented on GitHub (Feb 28, 2025):

@alex-jw-brooks's patch is now in main, so we can finally close this issue. Thanks!

<!-- gh-comment-id:2691194533 --> @jessegross commented on GitHub (Feb 28, 2025): @alex-jw-brooks's patch is now in `main`, so we can finally close this issue. Thanks!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30490