[GH-ISSUE #11452] Infintitely repeating responses with Qwen2.5VL-7B #7562

New Issue

GiteaMirror · 2026-04-12T19:39:37-05:00

GiteaMirror commented

2026-04-12 19:39:37 -05:00

Originally created by @alexmi256 on GitHub (Jul 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11452

What is the issue?

I'm trying to describe images using Qwen2.5VL-7B but noticed that Ollama hangs at certain images when run without OLLAMA_FLASH_ATTENTION while other images of same or larger size are unaffected which is pretty odd.

Image size is 1080x1080 (I believe this might resize to 1024x1024 internally) and the image is pretty simple with just some text.

This "hang" happens when Ollama reports this line over and over again.

time=2025-07-16T19:56:59.529-04:00 level=DEBUG source=cache.go:272 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046

I looked at the streaming response and the response is just repeating over and over again without stopping.

Arnaqueur c'est pas le même prix afficher en plus moi il m'as vendu une cuisinière à 650 meme le recu je l'est toujours j'allais acheter la la cuisinière à 850 $ quand je lui est dis j'ai toujours refusé il m'as forcer de la garder une fois j'ai refusé il m'as envoyé le reste grave je vais vous renvoyer le le recu il m'as rien envoyé et j'ai découvert que cette dernière est endommagée une fois j'ai demandé de la faire retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée ....

I tried using the API with

    "options": {
       "num_predict": 512
    },

And while this cuts off the response, it's still the repeating text.

My request to ollama is in https://gist.github.com/alexmi256/343a35e4453a9500e82572ad86b04ab6

I've tried GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 , OLLAMA_GPU_OVERHEAD=1536870912, OLLAMA_CONTEXT_LENGTH=8192 but all result the same and only OLLAMA_FLASH_ATTENTION or num_predict alleviates the issue.

Any idea why this happens?
Can someone else reproduce this issue with my prompt and Qwen2.5VL-7B?
Any other options/configs to prevent this from happening and get a legible response?

Relevant log output

$ OLLAMA_DEBUG=1 OLLAMA_HOST=192.168.2.44:11434 OLLAMA_MODELS='/usr/share/ollama/.ollama/models' OLLAMA_ORIGINS='http://192.168.2.*' GGML_CUDA_ENABLE_UNIFIED_M
EMORY=1 OLLAMA_NUM_PARALLEL=1 ollama serve                                                                                                                                          
time=2025-07-16T19:39:38.119-04:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VER
SION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://192.168.2.44:11434 OLLA
MA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/shar
e/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://192.168.2.* ht
tp://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 htt
p://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 
time=2025-07-16T19:39:38.120-04:00 level=INFO source=images.go:476 msg="total blobs: 30"                                                                                            
time=2025-07-16T19:39:38.120-04:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0"                                                                              
time=2025-07-16T19:39:38.120-04:00 level=INFO source=routes.go:1288 msg="Listening on 192.168.2.44:11434 (version 0.9.6)"                                                           
time=2025-07-16T19:39:38.121-04:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler"                                                                                     
time=2025-07-16T19:39:38.121-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"                                                                                   
time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"                                                              
time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so*                                                                   
time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /home/alex/libcuda.so* /usr/local/cuda*/targets/
*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/li
bcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"                                                                                      
time=2025-07-16T19:39:38.125-04:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03]                              
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03                                                                                                                         
dlsym: cuInit - 0x7e8f8a974640                                                                                                                                                      
dlsym: cuDriverGetVersion - 0x7e8f8a974700                                                                                                                                          
dlsym: cuDeviceGetCount - 0x7e8f8a974880                                                                                                                                            
dlsym: cuDeviceGet - 0x7e8f8a9747c0                                                                                                                                                 
dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0                                                                                                                                        
dlsym: cuDeviceGetUuid - 0x7e8f8a974a00                                                                                                                                             
dlsym: cuDeviceGetName - 0x7e8f8a974940                                                                                                                                             
dlsym: cuCtxCreate_v3 - 0x7e8f8a975900                                                                                                                                              
dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0                                                                                                                                             
dlsym: cuCtxDestroy - 0x7e8f8a9da4a0                                                                                                                                                
calling cuInit                                                                                                                                                                      
calling cuDriverGetVersion                                                                                                                                                          
raw version 0x2f3a                                                                                                                                                                  
CUDA driver version: 12.9                                                                                                                                                           
calling cuDeviceGetCount                                                                                                                                                            
device count 1                                                                                                                                                                      
time=2025-07-16T19:39:38.198-04:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03                                 
[GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] CUDA totalMem 24118mb                                                                                                                    
[GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] CUDA freeMem 23654mb                                                                                                                     
[GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] Compute Capability 8.6                                                                                                                   
time=2025-07-16T19:39:38.352-04:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"                                                          
releasing cuda driver library
time=2025-07-16T19:39:38.352-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 library=cuda variant=v12 compute=8.6 driver=12
.9 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.1 GiB"                                                                                                             
time=2025-07-16T19:39:43.377-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.2 GiB" before.free_swap="8.0 GiB" now.tot
al="62.7 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03
dlsym: cuInit - 0x7e8f8a974640
dlsym: cuDriverGetVersion - 0x7e8f8a974700
dlsym: cuDeviceGetCount - 0x7e8f8a974880
dlsym: cuDeviceGet - 0x7e8f8a9747c0
dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0
dlsym: cuDeviceGetUuid - 0x7e8f8a974a00
dlsym: cuDeviceGetName - 0x7e8f8a974940
dlsym: cuCtxCreate_v3 - 0x7e8f8a975900
dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0
dlsym: cuCtxDestroy - 0x7e8f8a9da4a0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f3a
CUDA driver version: 12.9
calling cuDeviceGetCount
device count 1
time=2025-07-16T19:39:43.532-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 name="NVIDIA GeForce RTX 3090" overhea
d="0 B" before.total="23.6 GiB" before.free="23.1 GiB" now.total="23.6 GiB" now.free="23.1 GiB" now.used="464.2 MiB"
releasing cuda driver library
time=2025-07-16T19:39:43.532-04:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-07-16T19:39:43.545-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=sched.go:228 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f0
994a83f8363df2c1e71c17605a025
time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.1 GiB]"
time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0
time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128
time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128
time=2025-07-16T19:39:43.575-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha
256-a99b7f834d754b88f122d865f32758ba9f0994a83f8363df2c1e71c17605a025 gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 parallel=1 available=24803016704 required="8.0 GiB"
time=2025-07-16T19:39:43.575-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.1 GiB" before.free_swap="8.0 GiB" now.tot
al="62.7 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03
dlsym: cuInit - 0x7e8f8a974640
dlsym: cuDriverGetVersion - 0x7e8f8a974700
dlsym: cuDeviceGetCount - 0x7e8f8a974880
dlsym: cuDeviceGet - 0x7e8f8a9747c0
dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0
dlsym: cuDeviceGetUuid - 0x7e8f8a974a00
dlsym: cuDeviceGetName - 0x7e8f8a974940
dlsym: cuCtxCreate_v3 - 0x7e8f8a975900
dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0
dlsym: cuCtxDestroy - 0x7e8f8a9da4a0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f3a
CUDA driver version: 12.9
calling cuDeviceGetCount
device count 1
time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 name="NVIDIA GeForce RTX 3090" overhea
d="0 B" before.total="23.6 GiB" before.free="23.1 GiB" now.total="23.6 GiB" now.free="23.1 GiB" now.used="464.2 MiB"
releasing cuda driver library
time=2025-07-16T19:39:43.721-04:00 level=INFO source=server.go:135 msg="system memory" total="62.7 GiB" free="60.1 GiB" free_swap="8.0 GiB"
time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.1 GiB]"
time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0
time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128
time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128
time=2025-07-16T19:39:43.722-04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="
[23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.0 GiB" memory.required.partial="8.0 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[8.0 GiB]" memory
.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="261.3 MiB" memory.graph.partial="261.3 MiB" projector.weights
="1.2 GiB" projector.graph="1.6 GiB"
time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}
\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23
 31]}"
time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/m
odels/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f0994a83f8363df2c1e71c17605a025 --ctx-size 4096 --batch-size 512 --n-gpu-layers 29 --threads 12 --parallel 1 --port 40451"
time=2025-07-16T19:39:43.745-04:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_NUM_PARALLEL=1 OLLAMA_MODELS=/usr/share/ollama/.ollama/models OLLAMA_HOST=192.168.2.44:114
34 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 OLLAMA_ORIGINS=http://192.168.2.* OLLAMA_DEBUG=1 PATH=/home/alex/.nvm/versions/node/v22.17.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/
bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/alex/.local/share/JetBrains/Toolbox/scripts:/home/alex/.local/share/JetBrains/Toolbox/scripts OLLAMA_MAX_LOADED_MODELS=3 
OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372
time=2025-07-16T19:39:43.745-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-16T19:39:43.754-04:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-07-16T19:39:43.754-04:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:40451"
time=2025-07-16T19:39:43.776-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.name default=""
time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default=""
time=2025-07-16T19:39:43.777-04:00 level=INFO source=ggml.go:92 msg="" architecture=qwen25vl file_type=Q4_K_M name="" description="" num_tensors=858 num_key_values=36
time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
time=2025-07-16T19:39:43.838-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFIL
E=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:359 msg="offloading 28 repeating layers to GPU"
time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:365 msg="offloading output layer to GPU"
time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:375 msg="offloaded 29/29 layers to GPU"
time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:377 msg="model weights" buffer=CPU size="292.4 MiB"
time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:377 msg="model weights" buffer=CUDA0 size="5.3 GiB"
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}
\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23
 31]}"
time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-07-16T19:39:43.996-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-07-16T19:39:44.188-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-07-16T19:39:44.189-04:00 level=DEBUG source=ggml.go:648 msg="compute graph" nodes=1748 splits=1
time=2025-07-16T19:39:44.189-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.7 GiB"
time=2025-07-16T19:39:44.189-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-07-16T19:39:44.222-04:00 level=DEBUG source=ggml.go:648 msg="compute graph" nodes=1073 splits=2
time=2025-07-16T19:39:44.222-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.7 GiB"
time=2025-07-16T19:39:44.222-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="16.8 MiB"
time=2025-07-16T19:39:44.228-04:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=306561024A allocated.CPU.Graph=17575936A allocated.CUDA0.UUID=GPU-9e2cae97-cd7
a-4f47-7eae-db2e2be81372 allocated.CUDA0.Weights="[149112832A 149112832A 149112832A 131608576A 131608576A 149112832A 131608576A 131135488A 148639744A 131608576A 131135488A 14863974
4A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 149112832A 148639744A 148639744A 149112832A 1
732615168A]" allocated.CUDA0.Cache="[8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A
 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 0U]" allocated.CUDA0.Graph=1781219584A
time=2025-07-16T19:39:44.247-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.02"
time=2025-07-16T19:39:44.497-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.26"
time=2025-07-16T19:39:44.748-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.47"
time=2025-07-16T19:39:44.998-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.67"
time=2025-07-16T19:39:45.249-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.86"
time=2025-07-16T19:39:45.499-04:00 level=INFO source=server.go:637 msg="llama runner started in 1.75 seconds"
time=2025-07-16T19:39:45.499-04:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen2.5vl:7b runner.inference=cuda runner.device
s=1 runner.size="8.0 GiB" runner.vram="8.0 GiB" runner.parallel=1 runner.pid=157407 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f099
4a83f8363df2c1e71c17605a025 runner.num_ctx=4096
time=2025-07-16T19:39:45.500-04:00 level=DEBUG source=server.go:736 msg="completion request" images=1 prompt=714 format=""
time=2025-07-16T19:39:45.502-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-07-16T19:39:45.700-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-07-16T19:39:45.700-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=1369 used=0 remaining=1369
time=2025-07-16T19:40:10.455-04:00 level=DEBUG source=cache.go:272 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.9.6

Originally created by @alexmi256 on GitHub (Jul 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11452 ### What is the issue? I'm trying to describe images using `Qwen2.5VL-7B` but noticed that Ollama hangs at certain images when run without `OLLAMA_FLASH_ATTENTION` while other images of same or larger size are unaffected which is pretty odd. Image size is 1080x1080 (I believe this might resize to 1024x1024 internally) and the image is pretty simple with just some text. ![Image](https://github.com/user-attachments/assets/57195705-3e85-4ff8-9ba5-0aca0226a7ff) This "hang" happens when Ollama reports this line over and over again. ``` time=2025-07-16T19:56:59.529-04:00 level=DEBUG source=cache.go:272 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046 ``` I looked at the streaming response and the response is just repeating over and over again without stopping. ``` Arnaqueur c'est pas le même prix afficher en plus moi il m'as vendu une cuisinière à 650 meme le recu je l'est toujours j'allais acheter la la cuisinière à 850 $ quand je lui est dis j'ai toujours refusé il m'as forcer de la garder une fois j'ai refusé il m'as envoyé le reste grave je vais vous renvoyer le le recu il m'as rien envoyé et j'ai découvert que cette dernière est endommagée une fois j'ai demandé de la faire retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée il m'as dit que que c'est de la faite retournée .... ``` I tried using the API with ``` "options": { "num_predict": 512 }, ``` And while this cuts off the response, it's still the repeating text. My request to ollama is in https://gist.github.com/alexmi256/343a35e4453a9500e82572ad86b04ab6 I've tried `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` , `OLLAMA_GPU_OVERHEAD=1536870912`, `OLLAMA_CONTEXT_LENGTH=8192` but all result the same and only `OLLAMA_FLASH_ATTENTION` or `num_predict` alleviates the issue. 1. Any idea why this happens? 2. Can someone else reproduce this issue with my prompt and `Qwen2.5VL-7B`? 3. Any other options/configs to prevent this from happening and get a legible response? ### Relevant log output ```shell $ OLLAMA_DEBUG=1 OLLAMA_HOST=192.168.2.44:11434 OLLAMA_MODELS='/usr/share/ollama/.ollama/models' OLLAMA_ORIGINS='http://192.168.2.*' GGML_CUDA_ENABLE_UNIFIED_M EMORY=1 OLLAMA_NUM_PARALLEL=1 ollama serve time=2025-07-16T19:39:38.119-04:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VER SION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://192.168.2.44:11434 OLLA MA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/shar e/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://192.168.2.* ht tp://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 htt p://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-07-16T19:39:38.120-04:00 level=INFO source=images.go:476 msg="total blobs: 30" time=2025-07-16T19:39:38.120-04:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0" time=2025-07-16T19:39:38.120-04:00 level=INFO source=routes.go:1288 msg="Listening on 192.168.2.44:11434 (version 0.9.6)" time=2025-07-16T19:39:38.121-04:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler" time=2025-07-16T19:39:38.121-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so* time=2025-07-16T19:39:38.122-04:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /home/alex/libcuda.so* /usr/local/cuda*/targets/ */lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/li bcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2025-07-16T19:39:38.125-04:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03] initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03 dlsym: cuInit - 0x7e8f8a974640 dlsym: cuDriverGetVersion - 0x7e8f8a974700 dlsym: cuDeviceGetCount - 0x7e8f8a974880 dlsym: cuDeviceGet - 0x7e8f8a9747c0 dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0 dlsym: cuDeviceGetUuid - 0x7e8f8a974a00 dlsym: cuDeviceGetName - 0x7e8f8a974940 dlsym: cuCtxCreate_v3 - 0x7e8f8a975900 dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0 dlsym: cuCtxDestroy - 0x7e8f8a9da4a0 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 1 time=2025-07-16T19:39:38.198-04:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03 [GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] CUDA totalMem 24118mb [GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] CUDA freeMem 23654mb [GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372] Compute Capability 8.6 time=2025-07-16T19:39:38.352-04:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2025-07-16T19:39:38.352-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 library=cuda variant=v12 compute=8.6 driver=12 .9 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.1 GiB" time=2025-07-16T19:39:43.377-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.2 GiB" before.free_swap="8.0 GiB" now.tot al="62.7 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03 dlsym: cuInit - 0x7e8f8a974640 dlsym: cuDriverGetVersion - 0x7e8f8a974700 dlsym: cuDeviceGetCount - 0x7e8f8a974880 dlsym: cuDeviceGet - 0x7e8f8a9747c0 dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0 dlsym: cuDeviceGetUuid - 0x7e8f8a974a00 dlsym: cuDeviceGetName - 0x7e8f8a974940 dlsym: cuCtxCreate_v3 - 0x7e8f8a975900 dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0 dlsym: cuCtxDestroy - 0x7e8f8a9da4a0 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 1 time=2025-07-16T19:39:43.532-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 name="NVIDIA GeForce RTX 3090" overhea d="0 B" before.total="23.6 GiB" before.free="23.1 GiB" now.total="23.6 GiB" now.free="23.1 GiB" now.used="464.2 MiB" releasing cuda driver library time=2025-07-16T19:39:43.532-04:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-07-16T19:39:43.545-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=sched.go:228 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f0 994a83f8363df2c1e71c17605a025 time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.1 GiB]" time=2025-07-16T19:39:43.573-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0 time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128 time=2025-07-16T19:39:43.574-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128 time=2025-07-16T19:39:43.575-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha 256-a99b7f834d754b88f122d865f32758ba9f0994a83f8363df2c1e71c17605a025 gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 parallel=1 available=24803016704 required="8.0 GiB" time=2025-07-16T19:39:43.575-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.7 GiB" before.free="60.1 GiB" before.free_swap="8.0 GiB" now.tot al="62.7 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.575.64.03 dlsym: cuInit - 0x7e8f8a974640 dlsym: cuDriverGetVersion - 0x7e8f8a974700 dlsym: cuDeviceGetCount - 0x7e8f8a974880 dlsym: cuDeviceGet - 0x7e8f8a9747c0 dlsym: cuDeviceGetAttribute - 0x7e8f8a974dc0 dlsym: cuDeviceGetUuid - 0x7e8f8a974a00 dlsym: cuDeviceGetName - 0x7e8f8a974940 dlsym: cuCtxCreate_v3 - 0x7e8f8a975900 dlsym: cuMemGetInfo_v2 - 0x7e8f8a9785a0 dlsym: cuCtxDestroy - 0x7e8f8a9da4a0 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 1 time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 name="NVIDIA GeForce RTX 3090" overhea d="0 B" before.total="23.6 GiB" before.free="23.1 GiB" now.total="23.6 GiB" now.free="23.1 GiB" now.used="464.2 MiB" releasing cuda driver library time=2025-07-16T19:39:43.721-04:00 level=INFO source=server.go:135 msg="system memory" total="62.7 GiB" free="60.1 GiB" free_swap="8.0 GiB" time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.1 GiB]" time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0 time=2025-07-16T19:39:43.721-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128 time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128 time=2025-07-16T19:39:43.722-04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available=" [23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.0 GiB" memory.required.partial="8.0 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[8.0 GiB]" memory .weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="261.3 MiB" memory.graph.partial="261.3 MiB" projector.weights ="1.2 GiB" projector.graph="1.6 GiB" time=2025-07-16T19:39:43.722-04:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L} \\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128 time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1 time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23 31]}" time=2025-07-16T19:39:43.744-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/m odels/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f0994a83f8363df2c1e71c17605a025 --ctx-size 4096 --batch-size 512 --n-gpu-layers 29 --threads 12 --parallel 1 --port 40451" time=2025-07-16T19:39:43.745-04:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_NUM_PARALLEL=1 OLLAMA_MODELS=/usr/share/ollama/.ollama/models OLLAMA_HOST=192.168.2.44:114 34 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 OLLAMA_ORIGINS=http://192.168.2.* OLLAMA_DEBUG=1 PATH=/home/alex/.nvm/versions/node/v22.17.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/ bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/alex/.local/share/JetBrains/Toolbox/scripts:/home/alex/.local/share/JetBrains/Toolbox/scripts OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-9e2cae97-cd7a-4f47-7eae-db2e2be81372 time=2025-07-16T19:39:43.745-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-07-16T19:39:43.745-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-07-16T19:39:43.754-04:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-07-16T19:39:43.754-04:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:40451" time=2025-07-16T19:39:43.776-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.name default="" time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default="" time=2025-07-16T19:39:43.777-04:00 level=INFO source=ggml.go:92 msg="" architecture=qwen25vl file_type=Q4_K_M name="" description="" num_tensors=858 num_key_values=36 time=2025-07-16T19:39:43.777-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so time=2025-07-16T19:39:43.838-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFIL E=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:359 msg="offloading 28 repeating layers to GPU" time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:365 msg="offloading output layer to GPU" time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:375 msg="offloaded 29/29 layers to GPU" time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:377 msg="model weights" buffer=CPU size="292.4 MiB" time=2025-07-16T19:39:43.929-04:00 level=INFO source=ggml.go:377 msg="model weights" buffer=CUDA0 size="5.3 GiB" time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L} \\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128 time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1 time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23 31]}" time=2025-07-16T19:39:43.929-04:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-07-16T19:39:43.996-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-07-16T19:39:44.188-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-07-16T19:39:44.189-04:00 level=DEBUG source=ggml.go:648 msg="compute graph" nodes=1748 splits=1 time=2025-07-16T19:39:44.189-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.7 GiB" time=2025-07-16T19:39:44.189-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" time=2025-07-16T19:39:44.222-04:00 level=DEBUG source=ggml.go:648 msg="compute graph" nodes=1073 splits=2 time=2025-07-16T19:39:44.222-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.7 GiB" time=2025-07-16T19:39:44.222-04:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="16.8 MiB" time=2025-07-16T19:39:44.228-04:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=306561024A allocated.CPU.Graph=17575936A allocated.CUDA0.UUID=GPU-9e2cae97-cd7 a-4f47-7eae-db2e2be81372 allocated.CUDA0.Weights="[149112832A 149112832A 149112832A 131608576A 131608576A 149112832A 131608576A 131135488A 148639744A 131608576A 131135488A 14863974 4A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 131608576A 131135488A 148639744A 149112832A 148639744A 148639744A 149112832A 1 732615168A]" allocated.CUDA0.Cache="[8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 8388608A 0U]" allocated.CUDA0.Graph=1781219584A time=2025-07-16T19:39:44.247-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.02" time=2025-07-16T19:39:44.497-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.26" time=2025-07-16T19:39:44.748-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.47" time=2025-07-16T19:39:44.998-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.67" time=2025-07-16T19:39:45.249-04:00 level=DEBUG source=server.go:643 msg="model load progress 0.86" time=2025-07-16T19:39:45.499-04:00 level=INFO source=server.go:637 msg="llama runner started in 1.75 seconds" time=2025-07-16T19:39:45.499-04:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen2.5vl:7b runner.inference=cuda runner.device s=1 runner.size="8.0 GiB" runner.vram="8.0 GiB" runner.parallel=1 runner.pid=157407 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-a99b7f834d754b88f122d865f32758ba9f099 4a83f8363df2c1e71c17605a025 runner.num_ctx=4096 time=2025-07-16T19:39:45.500-04:00 level=DEBUG source=server.go:736 msg="completion request" images=1 prompt=714 format="" time=2025-07-16T19:39:45.502-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-07-16T19:39:45.700-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-07-16T19:39:45.700-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=1369 used=0 remaining=1369 time=2025-07-16T19:40:10.455-04:00 level=DEBUG source=cache.go:272 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046 ``` ### OS Linux ### GPU Nvidia ### CPU _No response_ ### Ollama version 0.9.6

GiteaMirror added the bug label 2026-04-12 19:39:37 -05:00

GiteaMirror closed this issue

2026-04-12 19:39:37 -05:00

GiteaMirror commented

2026-04-12 19:39:38 -05:00

@cloudtuotuo commented on GitHub (Jul 18, 2025):

Qwen2.5VL has always had this kind of issue...repeating output for some pic...

@cloudtuotuo commented on GitHub (Jul 18, 2025): Qwen2.5VL has always had this kind of issue...repeating output for some pic...

GiteaMirror commented

2026-04-12 19:39:38 -05:00

@alexmi256 commented on GitHub (Jul 18, 2025):

Thanks for confirming this is not some odd one off error.

I looked into it more it looks like repetitions are a somewhat common LLM problem not only for that model.

I came across DRY sampler and tried using it in the API but wasn't able to get any change in output using

"options": {
  "dry_multiplier": 0.3
  "dry_base": 1.7
  "dry_allowed_length": 2
}

Probably user error in this case.

@alexmi256 commented on GitHub (Jul 18, 2025): Thanks for confirming this is not some odd one off error. I looked into it [more](https://github.com/QwenLM/Qwen2.5-VL/issues/241) it looks like repetitions are a somewhat common LLM problem not only for that model. I came across [DRY](https://github.com/ggml-org/llama.cpp/pull/9702) [sampler](https://github.com/oobabooga/text-generation-webui/pull/5677) and tried using it in the API but wasn't able to get any change in output using ```json "options": { "dry_multiplier": 0.3 "dry_base": 1.7 "dry_allowed_length": 2 } ``` Probably user error in this case.

GiteaMirror referenced this issue

2026-04-22 10:19:48 -05:00

[GH-ISSUE #7562] ollama update fails to restart systemd service #30575

GiteaMirror referenced this issue

2026-04-28 19:26:19 -05:00

[GH-ISSUE #7562] ollama update fails to restart systemd service #51326

GiteaMirror referenced this issue

2026-05-04 08:31:42 -05:00

[GH-ISSUE #7562] ollama update fails to restart systemd service #66871

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#7562