[GH-ISSUE #2813] Seems unable to use the "Dynamic High Resolution" feature of llava1.6 (aka llava-next) #63744

Open
opened 2026-05-03 14:50:42 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @jeff31415 on GitHub (Feb 28, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2813

https://llava-vl.github.io/blog/2024-01-30-llava-next/
Thanks for supporting llava1.6, but Ollama currently seems still unable to use the "Dynamic High Resolution" feature, which is important for llava1.6 to achieve state-of-the-art OCR accuracy and hallucination rate. As mention in another issue(degraded OCR accuracy)
#2562

time=2024-02-29T00:24:54.520+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
time=2024-02-29T00:24:54.521+08:00 level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
encode_image_with_clip: image embedding created: 576 tokens

Note: for llava1.6 to operate with more patches, there should be using more than 1.1k tokens per image, not 576. Also it is a good idea to allow parameters in request json to have options to choose input resolution, for balancing quality and inference cost.

Originally created by @jeff31415 on GitHub (Feb 28, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2813 https://llava-vl.github.io/blog/2024-01-30-llava-next/ Thanks for supporting llava1.6, but Ollama currently seems still unable to use the "Dynamic High Resolution" feature, which is important for llava1.6 to achieve state-of-the-art OCR accuracy and hallucination rate. As mention in another issue(degraded OCR accuracy) #2562 ``` time=2024-02-29T00:24:54.520+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" time=2024-02-29T00:24:54.521+08:00 level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" encode_image_with_clip: image embedding created: 576 tokens ``` Note: for llava1.6 to operate with more patches, there should be using more than 1.1k tokens per image, not 576. Also it is a good idea to allow parameters in request json to have options to choose input resolution, for balancing quality and inference cost.
Author
Owner

@RandomGitUser321 commented on GitHub (Mar 7, 2024):

EDIT: And for reference, I'm using the LLaVa v1.6 Mistral 7B model (q5km I think).

I found that if I manually downloaded a mmproj-model-f16.gguf off HF(I used the one from cjpais), moved it to the C:\Users\USERNAME.ollama\models\blobs folder and renamed it to the long sha256-1234567...(backing up the old one by changing the name to .bak), ollama is now showing me using 2880 tokens to encode an image, meaning it's doing the tiling stuff correctly. Just check the file sizes on the sha256 files, the mmproj will be like ~600MB, not the multi-GB sized one.

Here's what the server.log looks like now:

encode_image_with_clip: 5 segments encoded in   462.51 ms
encode_image_with_clip: image embedding created: 2880 tokens

encode_image_with_clip: image encoded in   483.20 ms by CLIP (    0.17 ms per image patch)
{"function":"print_timings","level":"INFO","line":264,"msg":"prompt eval time     =    2494.64 ms /     1 tokens ( 2494.64 ms per token,     0.40 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.4008587998928906,"slot_id":0,"t_prompt_processing":2494.644,"t_token":2494.644,"task_id":0,"tid":"10108","timestamp":1709774423}
{"function":"print_timings","level":"INFO","line":278,"msg":"generation eval time =    1518.83 ms /    85 runs   (   17.87 ms per token,    55.96 tokens per second)","n_decoded":85,"n_tokens_second":55.9643145194476,"slot_id":0,"t_token":17.868529411764705,"t_token_generation":1518.825,"task_id":0,"tid":"10108","timestamp":1709774423}
{"function":"print_timings","level":"INFO","line":287,"msg":"          total time =    4013.47 ms","slot_id":0,"t_prompt_processing":2494.644,"t_token_generation":1518.825,"t_total":4013.469,"task_id":0,"tid":"10108","timestamp":1709774423}

Here is with the mmproj model that ollama wants to download(shows 576 tokens for an image):

clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

Here's what it looks like when I used a mmproj model that I know works(shows 2880 tokens for an image):

clip_model_load: model name:   vit-large336-custom
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    378
clip_model_load: n_kv:         25
clip_model_load: ftype:        f16

My guess is maybe the database is using the old v1.5 mmproj model and doesn't have the correct v1.6 one?

<!-- gh-comment-id:1982173138 --> @RandomGitUser321 commented on GitHub (Mar 7, 2024): EDIT: And for reference, I'm using the LLaVa v1.6 Mistral 7B model (q5km I think). I found that if I manually downloaded a mmproj-model-f16.gguf off HF(I used the one from cjpais), moved it to the C:\Users\USERNAME\.ollama\models\blobs folder and renamed it to the long sha256-1234567...(backing up the old one by changing the name to .bak), ollama is now showing me using 2880 tokens to encode an image, meaning it's doing the tiling stuff correctly. Just check the file sizes on the sha256 files, the mmproj will be like ~600MB, not the multi-GB sized one. Here's what the server.log looks like now: ``` encode_image_with_clip: 5 segments encoded in 462.51 ms encode_image_with_clip: image embedding created: 2880 tokens encode_image_with_clip: image encoded in 483.20 ms by CLIP ( 0.17 ms per image patch) {"function":"print_timings","level":"INFO","line":264,"msg":"prompt eval time = 2494.64 ms / 1 tokens ( 2494.64 ms per token, 0.40 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.4008587998928906,"slot_id":0,"t_prompt_processing":2494.644,"t_token":2494.644,"task_id":0,"tid":"10108","timestamp":1709774423} {"function":"print_timings","level":"INFO","line":278,"msg":"generation eval time = 1518.83 ms / 85 runs ( 17.87 ms per token, 55.96 tokens per second)","n_decoded":85,"n_tokens_second":55.9643145194476,"slot_id":0,"t_token":17.868529411764705,"t_token_generation":1518.825,"task_id":0,"tid":"10108","timestamp":1709774423} {"function":"print_timings","level":"INFO","line":287,"msg":" total time = 4013.47 ms","slot_id":0,"t_prompt_processing":2494.644,"t_token_generation":1518.825,"t_total":4013.469,"task_id":0,"tid":"10108","timestamp":1709774423} ``` Here is with the mmproj model that ollama wants to download(shows 576 tokens for an image): ``` clip_model_load: model name: openai/clip-vit-large-patch14-336 clip_model_load: description: image encoder for LLaVA clip_model_load: GGUF version: 3 clip_model_load: alignment: 32 clip_model_load: n_tensors: 377 clip_model_load: n_kv: 19 clip_model_load: ftype: f16 ``` Here's what it looks like when I used a mmproj model that I know works(shows 2880 tokens for an image): ``` clip_model_load: model name: vit-large336-custom clip_model_load: description: image encoder for LLaVA clip_model_load: GGUF version: 3 clip_model_load: alignment: 32 clip_model_load: n_tensors: 378 clip_model_load: n_kv: 25 clip_model_load: ftype: f16 ``` My guess is maybe the database is using the old v1.5 mmproj model and doesn't have the correct v1.6 one?
Author
Owner

@blurgyy commented on GitHub (Apr 17, 2024):

For anybody observing empty output while using LLaVA 1.6 34b and have replaced the mmproj model (sha256-83720bd8438ccdc910deba5efbdc3340820b29258d94a7a60d1addc9a1b5f095 in my ~/.ollama/models/blobs) with https://hf-mirror.com/cjpais/llava-v1.6-34B-gguf/blob/main/mmproj-model-f16.gguf, the reason is the default context window size is 2048, and the correct image embedding has 2880 tokens which is larger than the context window, the sign is a log line like this:

{"function":"update_slots","level":"ERR","line":1876,"msg":"failed processing images","slot_id":0,"task_id":8,"tid":"139989500841984","timestamp":1713321617}

The solution is to create a new model with a larger context window (with PARAMETER num_ctx) from the LLaVA 1.6 34b model, e.g. with this model file:

FROM llava:34b-v1.6-q8_0
TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER num_ctx 8192
<!-- gh-comment-id:2060247929 --> @blurgyy commented on GitHub (Apr 17, 2024): For anybody observing empty output while using LLaVA 1.6 34b and have replaced the mmproj model (`sha256-83720bd8438ccdc910deba5efbdc3340820b29258d94a7a60d1addc9a1b5f095` in my ~/.ollama/models/blobs) with <https://hf-mirror.com/cjpais/llava-v1.6-34B-gguf/blob/main/mmproj-model-f16.gguf>, the reason is the default context window size is 2048, and the correct image embedding has 2880 tokens which is larger than the context window, the sign is a log line like this: ```log {"function":"update_slots","level":"ERR","line":1876,"msg":"failed processing images","slot_id":0,"task_id":8,"tid":"139989500841984","timestamp":1713321617} ``` The solution is to create a new model with a larger context window (with `PARAMETER num_ctx`) from the LLaVA 1.6 34b model, e.g. with this model file: ```modelfile FROM llava:34b-v1.6-q8_0 TEMPLATE """<|im_start|>system {{ .System }}<|im_end|> <|im_start|>user {{ .Prompt }}<|im_end|> <|im_start|>assistant """ PARAMETER stop "<|im_start|>" PARAMETER stop "<|im_end|>" PARAMETER num_ctx 8192 ```
Author
Owner

@blurgyy commented on GitHub (Apr 17, 2024):

I've put together a LLaVA 1.6 34b model with Q8_0 quantization according to my previous comment at https://ollama.com/highsunz/llava:34b-v1.6-q8_0-hires-ctx8k, it's projection model projects an input image to 2880 tokens.

<!-- gh-comment-id:2061495646 --> @blurgyy commented on GitHub (Apr 17, 2024): I've put together a LLaVA 1.6 34b model with Q8_0 quantization according to my previous comment at <https://ollama.com/highsunz/llava:34b-v1.6-q8_0-hires-ctx8k>, it's projection model projects an input image to 2880 tokens.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63744