[GH-ISSUE #11452] Infintitely repeating responses with Qwen2.5VL-7B #7562

Closed
opened 2026-04-12 19:39:37 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @alexmi256 on GitHub (Jul 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11452

What is the issue?

I'm trying to describe images using Qwen2.5VL-7B but noticed that Ollama hangs at certain images when run without OLLAMA_FLASH_ATTENTION while other images of same or larger size are unaffected which is pretty odd.

Image size is 1080x1080 (I believe this might resize to 1024x1024 internally) and the image is pretty simple with just some text.

Image

This "hang" happens when Ollama reports this line over and over again.

time=2025-07-16T19:56:59.529-04:00 level=DEBUG source=cache.go:272 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046

I looked at the streaming response and the response is just repeating over and over again without stopping.

Arnaqueur c'est pas le même prix afficher en plus moi il m'as vendu une cuisinière à 650 meme le recu je l'est toujours j'allais acheter la la cuisinière à 850 $ quand je lui est dis j'ai toujours refusé il m'as forcer de la garder une fois j'ai refusé il m'as envoyé le reste grave je vais vous renvoyer le le recu il m'as rien envoyé et j'ai découvert que cette dernière est endommagée une fois j'ai demandé de la faire retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée ....

I tried using the API with

    "options": {
       "num_predict": 512
    },

And while this cuts off the response, it's still the repeating text.

My request to ollama is in https://gist.github.com/alexmi256/343a35e4453a9500e82572ad86b04ab6

I've tried GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 , OLLAMA_GPU_OVERHEAD=1536870912, OLLAMA_CONTEXT_LENGTH=8192 but all result the same and only OLLAMA_FLASH_ATTENTION or num_predict alleviates the issue.

  1. Any idea why this happens?
  2. Can someone else reproduce this issue with my prompt and Qwen2.5VL-7B?
  3. Any other options/configs to prevent this from happening and get a legible response?

Relevant log output

$ OLLAMA_DEBUG=1 OLLAMA_HOST=192.168.2.44:11434 OLLAMA_MODELS='/usr/share/ollama/.ollama/models' OLLAMA_ORIGINS='http://192.168.2.*' GGML_CUDA_ENABLE_UNIFIED_M
EMORY=1 OLLAMA_NUM_PARALLEL=1 ollama serve                                                                                                                                          
time=2025-07-16T19:39:38.119-04:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VER
SION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://192.168.2.44:11434 OLLA
MA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/shar
e/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://192.168.2.* ht
tp://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 htt
p://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 
time=2025-07-16T19:39:38.120-04:00 level=INFO source=images.go:476 msg="total blobs: 30"                                                                                            
time=2025-07-16T19:39:38.120-04:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0"                                                                              
time=2025-07-16T19:39:38.120-04:00 level=INFO source=routes.go:1288 msg="Listening on 192.168.2.44:11434 (version 0.9.6)"                                                           
time=2025-07-16T19:39:38.121-04:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler"                                                                                     
time=2025-07-16T19:39:38.121-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"                                                                                   
time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"                                                              
time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so*                                                                   
time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /home/alex/libcuda.so* /usr/local/cuda*/targets/
*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/li
bcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"                                                                                      
time=2025-07-16T19:39:38.125-04:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03]                              
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03                                                                                                                         
dlsym: cuInit - 0x7e8f8a974640                                                                                                                                                      
dlsym: cuDriverGetVersion - 0x7e8f8a974700                                                                                                                                          
dlsym: cuDeviceGetCount - 0x7e8f8a974880                                                                                                                                            
dlsym: cuDeviceGet - 0x7e8f8a9747c0                                                                                                                                                 
dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0                                                                                                                                        
dlsym: cuDeviceGetUuid - 0x7e8f8a974a00                                                                                                                                             
dlsym: cuDeviceGetName - 0x7e8f8a974940                                                                                                                                             
dlsym: cuCtxCreate_v3 - 0x7e8f8a975900                                                                                                                                              
dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0                                                                                                                                             
dlsym: cuCtxDestroy - 0x7e8f8a9da4a0                                                                                                                                                
calling cuInit                                                                                                                                                                      
calling cuDriverGetVersion                                                                                                                                                          
raw version 0x2f3a                                                                                                                                                                  
CUDA driver version: 12.9                                                                                                                                                           
calling cuDeviceGetCount                                                                                                                                                            
device count 1                                                                                                                                                                      
time=2025-07-16T19:39:38.198-04:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03                                 
[GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] CUDA totalMem 24118mb                                                                                                                    
[GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] CUDA freeMem 23654mb                                                                                                                     
[GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] Compute Capability 8.6                                                                                                                   
time=2025-07-16T19:39:38.352-04:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"                                                          
releasing cuda driver library
time=2025-07-16T19:39:38.352-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 library=cuda variant=v12 compute=8.6 driver=12
.9 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.1 GiB"                                                                                                             
time=2025-07-16T19:39:43.377-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.2 GiB" before.free_swap="8.0 GiB" now.tot
al="62.7 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03
dlsym: cuInit - 0x7e8f8a974640
dlsym: cuDriverGetVersion - 0x7e8f8a974700
dlsym: cuDeviceGetCount - 0x7e8f8a974880
dlsym: cuDeviceGet - 0x7e8f8a9747c0
dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0
dlsym: cuDeviceGetUuid - 0x7e8f8a974a00
dlsym: cuDeviceGetName - 0x7e8f8a974940
dlsym: cuCtxCreate_v3 - 0x7e8f8a975900
dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0
dlsym: cuCtxDestroy - 0x7e8f8a9da4a0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f3a
CUDA driver version: 12.9
calling cuDeviceGetCount
device count 1
time=2025-07-16T19:39:43.532-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 name="NVIDIA GeForce RTX 3090" overhea
d="0 B" before.total="23.6 GiB" before.free="23.1 GiB" now.total="23.6 GiB" now.free="23.1 GiB" now.used="464.2 MiB"
releasing cuda driver library
time=2025-07-16T19:39:43.532-04:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-07-16T19:39:43.545-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=sched.go:228 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f0
994a83f8363df2c1e71c17605a025
time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.1 GiB]"
time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0
time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128
time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128
time=2025-07-16T19:39:43.575-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha
256-a99b7f834d754b88f122d865f32758ba9f0994a83f8363df2c1e71c17605a025 gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 parallel=1 available=24803016704 required="8.0 GiB"
time=2025-07-16T19:39:43.575-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.1 GiB" before.free_swap="8.0 GiB" now.tot
al="62.7 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03
dlsym: cuInit - 0x7e8f8a974640
dlsym: cuDriverGetVersion - 0x7e8f8a974700
dlsym: cuDeviceGetCount - 0x7e8f8a974880
dlsym: cuDeviceGet - 0x7e8f8a9747c0
dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0
dlsym: cuDeviceGetUuid - 0x7e8f8a974a00
dlsym: cuDeviceGetName - 0x7e8f8a974940
dlsym: cuCtxCreate_v3 - 0x7e8f8a975900
dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0
dlsym: cuCtxDestroy - 0x7e8f8a9da4a0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f3a
CUDA driver version: 12.9
calling cuDeviceGetCount
device count 1
time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 name="NVIDIA GeForce RTX 3090" overhea
d="0 B" before.total="23.6 GiB" before.free="23.1 GiB" now.total="23.6 GiB" now.free="23.1 GiB" now.used="464.2 MiB"
releasing cuda driver library
time=2025-07-16T19:39:43.721-04:00 level=INFO source=server.go:135 msg="system memory" total="62.7 GiB" free="60.1 GiB" free_swap="8.0 GiB"
time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.1 GiB]"
time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0
time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128
time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128
time=2025-07-16T19:39:43.722-04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="
[23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.0 GiB" memory.required.partial="8.0 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[8.0 GiB]" memory
.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="261.3 MiB" memory.graph.partial="261.3 MiB" projector.weights
="1.2 GiB" projector.graph="1.6 GiB"
time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}
\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23
 31]}"
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/m
odels/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f0994a83f8363df2c1e71c17605a025 --ctx-size 4096 --batch-size 512 --n-gpu-layers 29 --threads 12 --parallel 1 --port 40451"
time=2025-07-16T19:39:43.745-04:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_NUM_PARALLEL=1 OLLAMA_MODELS=/usr/share/ollama/.ollama/models OLLAMA_HOST=192.168.2.44:114
34 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 OLLAMA_ORIGINS=http://192.168.2.* OLLAMA_DEBUG=1 PATH=/home/alex/.nvm/versions/node/v22.17.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/
bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/alex/.local/share/JetBrains/Toolbox/scripts:/home/alex/.local/share/JetBrains/Toolbox/scripts OLLAMA_MAX_LOADED_MODELS=3 
OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372
time=2025-07-16T19:39:43.745-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-16T19:39:43.754-04:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-07-16T19:39:43.754-04:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:40451"
time=2025-07-16T19:39:43.776-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.name default=""
time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default=""
time=2025-07-16T19:39:43.777-04:00 level=INFO source=ggml.go:92 msg="" architecture=qwen25vl file_type=Q4_K_M name="" description="" num_tensors=858 num_key_values=36
time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
time=2025-07-16T19:39:43.838-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFIL
E=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:359 msg="offloading 28 repeating layers to GPU"
time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:365 msg="offloading output layer to GPU"
time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:375 msg="offloaded 29/29 layers to GPU"
time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:377 msg="model weights" buffer=CPU size="292.4 MiB"
time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:377 msg="model weights" buffer=CUDA0 size="5.3 GiB"
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}
\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23
 31]}"
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-07-16T19:39:43.996-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-07-16T19:39:44.188-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-07-16T19:39:44.189-04:00 level=DEBUG source=ggml.go:648 msg="compute graph" nodes=1748 splits=1
time=2025-07-16T19:39:44.189-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.7 GiB"
time=2025-07-16T19:39:44.189-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-07-16T19:39:44.222-04:00 level=DEBUG source=ggml.go:648 msg="compute graph" nodes=1073 splits=2
time=2025-07-16T19:39:44.222-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.7 GiB"
time=2025-07-16T19:39:44.222-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="16.8 MiB"
time=2025-07-16T19:39:44.228-04:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=306561024A allocated.CPU.Graph=17575936A allocated.CUDA0.UUID=GPU-9e2cae97-cd7
a-4f47-7eae-db2e2be81372 allocated.CUDA0.Weights="[149112832A 149112832A 149112832A 131608576A 131608576A 149112832A 131608576A 131135488A 148639744A 131608576A 131135488A 14863974
4A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 149112832A 148639744A 148639744A 149112832A 1
732615168A]" allocated.CUDA0.Cache="[8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A
 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 0U]" allocated.CUDA0.Graph=1781219584A
time=2025-07-16T19:39:44.247-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.02"
time=2025-07-16T19:39:44.497-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.26"
time=2025-07-16T19:39:44.748-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.47"
time=2025-07-16T19:39:44.998-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.67"
time=2025-07-16T19:39:45.249-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.86"
time=2025-07-16T19:39:45.499-04:00 level=INFO source=server.go:637 msg="llama runner started in 1.75 seconds"
time=2025-07-16T19:39:45.499-04:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen2.5vl:7b runner.inference=cuda runner.device
s=1 runner.size="8.0 GiB" runner.vram="8.0 GiB" runner.parallel=1 runner.pid=157407 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f099
4a83f8363df2c1e71c17605a025 runner.num_ctx=4096
time=2025-07-16T19:39:45.500-04:00 level=DEBUG source=server.go:736 msg="completion request" images=1 prompt=714 format=""
time=2025-07-16T19:39:45.502-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-07-16T19:39:45.700-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-07-16T19:39:45.700-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=1369 used=0 remaining=1369
time=2025-07-16T19:40:10.455-04:00 level=DEBUG source=cache.go:272 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.9.6

Originally created by @alexmi256 on GitHub (Jul 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11452 ### What is the issue? I'm trying to describe images using `Qwen2.5VL-7B` but noticed that Ollama hangs at certain images when run without `OLLAMA_FLASH_ATTENTION` while other images of same or larger size are unaffected which is pretty odd. Image size is 1080x1080 (I believe this might resize to 1024x1024 internally) and the image is pretty simple with just some text. ![Image](https://github.com/user-attachments/assets/57195705-3e85-4ff8-9ba5-0aca0226a7ff) This "hang" happens when Ollama reports this line over and over again. ``` time=2025-07-16T19:56:59.529-04:00 level=DEBUG source=cache.go:272 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046 ``` I looked at the streaming response and the response is just repeating over and over again without stopping. ``` Arnaqueur c'est pas le même prix afficher en plus moi il m'as vendu une cuisinière à 650 meme le recu je l'est toujours j'allais acheter la la cuisinière à 850 $ quand je lui est dis j'ai toujours refusé il m'as forcer de la garder une fois j'ai refusé il m'as envoyé le reste grave je vais vous renvoyer le le recu il m'as rien envoyé et j'ai découvert que cette dernière est endommagée une fois j'ai demandé de la faire retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée .... ``` I tried using the API with ``` "options": { "num_predict": 512 }, ``` And while this cuts off the response, it's still the repeating text. My request to ollama is in https://gist.github.com/alexmi256/343a35e4453a9500e82572ad86b04ab6 I've tried `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` , `OLLAMA_GPU_OVERHEAD=1536870912`, `OLLAMA_CONTEXT_LENGTH=8192` but all result the same and only `OLLAMA_FLASH_ATTENTION` or `num_predict` alleviates the issue. 1. Any idea why this happens? 2. Can someone else reproduce this issue with my prompt and `Qwen2.5VL-7B`? 3. Any other options/configs to prevent this from happening and get a legible response? ### Relevant log output ```shell $ OLLAMA_DEBUG=1 OLLAMA_HOST=192.168.2.44:11434 OLLAMA_MODELS='/usr/share/ollama/.ollama/models' OLLAMA_ORIGINS='http://192.168.2.*' GGML_CUDA_ENABLE_UNIFIED_M EMORY=1 OLLAMA_NUM_PARALLEL=1 ollama serve time=2025-07-16T19:39:38.119-04:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VER SION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://192.168.2.44:11434 OLLA MA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/shar e/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://192.168.2.* ht tp://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 htt p://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-07-16T19:39:38.120-04:00 level=INFO source=images.go:476 msg="total blobs: 30" time=2025-07-16T19:39:38.120-04:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0" time=2025-07-16T19:39:38.120-04:00 level=INFO source=routes.go:1288 msg="Listening on 192.168.2.44:11434 (version 0.9.6)" time=2025-07-16T19:39:38.121-04:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler" time=2025-07-16T19:39:38.121-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so* time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /home/alex/libcuda.so* /usr/local/cuda*/targets/ */lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/li bcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2025-07-16T19:39:38.125-04:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03] initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03 dlsym: cuInit - 0x7e8f8a974640 dlsym: cuDriverGetVersion - 0x7e8f8a974700 dlsym: cuDeviceGetCount - 0x7e8f8a974880 dlsym: cuDeviceGet - 0x7e8f8a9747c0 dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0 dlsym: cuDeviceGetUuid - 0x7e8f8a974a00 dlsym: cuDeviceGetName - 0x7e8f8a974940 dlsym: cuCtxCreate_v3 - 0x7e8f8a975900 dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0 dlsym: cuCtxDestroy - 0x7e8f8a9da4a0 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 1 time=2025-07-16T19:39:38.198-04:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03 [GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] CUDA totalMem 24118mb [GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] CUDA freeMem 23654mb [GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] Compute Capability 8.6 time=2025-07-16T19:39:38.352-04:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2025-07-16T19:39:38.352-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 library=cuda variant=v12 compute=8.6 driver=12 .9 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.1 GiB" time=2025-07-16T19:39:43.377-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.2 GiB" before.free_swap="8.0 GiB" now.tot al="62.7 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03 dlsym: cuInit - 0x7e8f8a974640 dlsym: cuDriverGetVersion - 0x7e8f8a974700 dlsym: cuDeviceGetCount - 0x7e8f8a974880 dlsym: cuDeviceGet - 0x7e8f8a9747c0 dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0 dlsym: cuDeviceGetUuid - 0x7e8f8a974a00 dlsym: cuDeviceGetName - 0x7e8f8a974940 dlsym: cuCtxCreate_v3 - 0x7e8f8a975900 dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0 dlsym: cuCtxDestroy - 0x7e8f8a9da4a0 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 1 time=2025-07-16T19:39:43.532-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 name="NVIDIA GeForce RTX 3090" overhea d="0 B" before.total="23.6 GiB" before.free="23.1 GiB" now.total="23.6 GiB" now.free="23.1 GiB" now.used="464.2 MiB" releasing cuda driver library time=2025-07-16T19:39:43.532-04:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-07-16T19:39:43.545-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=sched.go:228 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f0 994a83f8363df2c1e71c17605a025 time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.1 GiB]" time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0 time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128 time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128 time=2025-07-16T19:39:43.575-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha 256-a99b7f834d754b88f122d865f32758ba9f0994a83f8363df2c1e71c17605a025 gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 parallel=1 available=24803016704 required="8.0 GiB" time=2025-07-16T19:39:43.575-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.1 GiB" before.free_swap="8.0 GiB" now.tot al="62.7 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03 dlsym: cuInit - 0x7e8f8a974640 dlsym: cuDriverGetVersion - 0x7e8f8a974700 dlsym: cuDeviceGetCount - 0x7e8f8a974880 dlsym: cuDeviceGet - 0x7e8f8a9747c0 dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0 dlsym: cuDeviceGetUuid - 0x7e8f8a974a00 dlsym: cuDeviceGetName - 0x7e8f8a974940 dlsym: cuCtxCreate_v3 - 0x7e8f8a975900 dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0 dlsym: cuCtxDestroy - 0x7e8f8a9da4a0 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 1 time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 name="NVIDIA GeForce RTX 3090" overhea d="0 B" before.total="23.6 GiB" before.free="23.1 GiB" now.total="23.6 GiB" now.free="23.1 GiB" now.used="464.2 MiB" releasing cuda driver library time=2025-07-16T19:39:43.721-04:00 level=INFO source=server.go:135 msg="system memory" total="62.7 GiB" free="60.1 GiB" free_swap="8.0 GiB" time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.1 GiB]" time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0 time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128 time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128 time=2025-07-16T19:39:43.722-04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available=" [23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.0 GiB" memory.required.partial="8.0 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[8.0 GiB]" memory .weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="261.3 MiB" memory.graph.partial="261.3 MiB" projector.weights ="1.2 GiB" projector.graph="1.6 GiB" time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L} \\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128 time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1 time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23 31]}" time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/m odels/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f0994a83f8363df2c1e71c17605a025 --ctx-size 4096 --batch-size 512 --n-gpu-layers 29 --threads 12 --parallel 1 --port 40451" time=2025-07-16T19:39:43.745-04:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_NUM_PARALLEL=1 OLLAMA_MODELS=/usr/share/ollama/.ollama/models OLLAMA_HOST=192.168.2.44:114 34 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 OLLAMA_ORIGINS=http://192.168.2.* OLLAMA_DEBUG=1 PATH=/home/alex/.nvm/versions/node/v22.17.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/ bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/alex/.local/share/JetBrains/Toolbox/scripts:/home/alex/.local/share/JetBrains/Toolbox/scripts OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 time=2025-07-16T19:39:43.745-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-07-16T19:39:43.754-04:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-07-16T19:39:43.754-04:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:40451" time=2025-07-16T19:39:43.776-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.name default="" time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default="" time=2025-07-16T19:39:43.777-04:00 level=INFO source=ggml.go:92 msg="" architecture=qwen25vl file_type=Q4_K_M name="" description="" num_tensors=858 num_key_values=36 time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so time=2025-07-16T19:39:43.838-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFIL E=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:359 msg="offloading 28 repeating layers to GPU" time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:365 msg="offloading output layer to GPU" time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:375 msg="offloaded 29/29 layers to GPU" time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:377 msg="model weights" buffer=CPU size="292.4 MiB" time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:377 msg="model weights" buffer=CUDA0 size="5.3 GiB" time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L} \\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128 time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1 time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23 31]}" time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-07-16T19:39:43.996-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-07-16T19:39:44.188-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-07-16T19:39:44.189-04:00 level=DEBUG source=ggml.go:648 msg="compute graph" nodes=1748 splits=1 time=2025-07-16T19:39:44.189-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.7 GiB" time=2025-07-16T19:39:44.189-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" time=2025-07-16T19:39:44.222-04:00 level=DEBUG source=ggml.go:648 msg="compute graph" nodes=1073 splits=2 time=2025-07-16T19:39:44.222-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.7 GiB" time=2025-07-16T19:39:44.222-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="16.8 MiB" time=2025-07-16T19:39:44.228-04:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=306561024A allocated.CPU.Graph=17575936A allocated.CUDA0.UUID=GPU-9e2cae97-cd7 a-4f47-7eae-db2e2be81372 allocated.CUDA0.Weights="[149112832A 149112832A 149112832A 131608576A 131608576A 149112832A 131608576A 131135488A 148639744A 131608576A 131135488A 14863974 4A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 149112832A 148639744A 148639744A 149112832A 1 732615168A]" allocated.CUDA0.Cache="[8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 0U]" allocated.CUDA0.Graph=1781219584A time=2025-07-16T19:39:44.247-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.02" time=2025-07-16T19:39:44.497-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.26" time=2025-07-16T19:39:44.748-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.47" time=2025-07-16T19:39:44.998-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.67" time=2025-07-16T19:39:45.249-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.86" time=2025-07-16T19:39:45.499-04:00 level=INFO source=server.go:637 msg="llama runner started in 1.75 seconds" time=2025-07-16T19:39:45.499-04:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen2.5vl:7b runner.inference=cuda runner.device s=1 runner.size="8.0 GiB" runner.vram="8.0 GiB" runner.parallel=1 runner.pid=157407 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f099 4a83f8363df2c1e71c17605a025 runner.num_ctx=4096 time=2025-07-16T19:39:45.500-04:00 level=DEBUG source=server.go:736 msg="completion request" images=1 prompt=714 format="" time=2025-07-16T19:39:45.502-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-07-16T19:39:45.700-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-07-16T19:39:45.700-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=1369 used=0 remaining=1369 time=2025-07-16T19:40:10.455-04:00 level=DEBUG source=cache.go:272 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046 ``` ### OS Linux ### GPU Nvidia ### CPU _No response_ ### Ollama version 0.9.6
GiteaMirror added the bug label 2026-04-12 19:39:37 -05:00
Author
Owner

@cloudtuotuo commented on GitHub (Jul 18, 2025):

Qwen2.5VL has always had this kind of issue...repeating output for some pic...

<!-- gh-comment-id:3086409190 --> @cloudtuotuo commented on GitHub (Jul 18, 2025): Qwen2.5VL has always had this kind of issue...repeating output for some pic...
Author
Owner

@alexmi256 commented on GitHub (Jul 18, 2025):

Thanks for confirming this is not some odd one off error.

I looked into it more it looks like repetitions are a somewhat common LLM problem not only for that model.

I came across DRY sampler and tried using it in the API but wasn't able to get any change in output using

"options": {
  "dry_multiplier": 0.3
  "dry_base": 1.7
  "dry_allowed_length": 2
}

Probably user error in this case.

<!-- gh-comment-id:3089774887 --> @alexmi256 commented on GitHub (Jul 18, 2025): Thanks for confirming this is not some odd one off error. I looked into it [more](https://github.com/QwenLM/Qwen2.5-VL/issues/241) it looks like repetitions are a somewhat common LLM problem not only for that model. I came across [DRY](https://github.com/ggml-org/llama.cpp/pull/9702) [sampler](https://github.com/oobabooga/text-generation-webui/pull/5677) and tried using it in the API but wasn't able to get any change in output using ```json "options": { "dry_multiplier": 0.3 "dry_base": 1.7 "dry_allowed_length": 2 } ``` Probably user error in this case.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7562