[GH-ISSUE #14312] qwen2.5vl:7b uses significantly more RAM than qwen3-vl:8b despite smaller architecture and identical quantization #9313

Open
opened 2026-04-12 22:10:36 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @brendensoares on GitHub (Feb 18, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14312

What is the issue?

What happened?

When running both models locally with Ollama using the same quantization (Q4_K_M) and comparable context sizes, qwen2.5vl:7b consistently consumes significantly more total memory than qwen3-vl:8b.

Observed behavior:

  • qwen2.5vl:7b
    • 8,192 context → ~15 GB
    • 16,384 context → ~17 GB
  • qwen3-vl:8b
    • 32,768 context → ~11 GB

This occurs even though:

  • qwen2.5vl:7b has ~8.3B parameters
  • qwen3-vl:8b has ~8.8B parameters
  • Both use Q4_K_M quantization

Additionally, doubling context for qwen2.5vl:7b (8k → 16k) increases memory by ~2 GB, which is significantly larger than what would be expected from KV cache growth alone based on published architecture parameters.

From the model configs:

Qwen2.5-VL-7B (text tower)

  • 28 layers
  • 4 KV heads
  • head dim = 128

Expected KV cache size at fp16:

  • ~0.44 GB @ 8k
  • ~0.88 GB @ 16k

However, the observed memory increase (~2 GB when doubling context) suggests a ~4–5× multiplier over the theoretical KV size.

This indicates that:

  • KV cache may be stored at higher precision (e.g., fp32),
  • KV may be duplicated or staged,
  • or there is unexpected additional runtime allocation for this model.

What did you expect to happen?

Given:

  • qwen3-vl:8b has more layers (36 vs 28),
  • more KV heads (8 vs 4),
  • larger hidden size (4096 vs 3584),

The theoretical KV cache size for qwen3-vl:8b should be significantly larger than for qwen2.5vl:7b at the same context length.

Therefore, total runtime memory usage for qwen3-vl:8b should be equal to or greater than qwen2.5vl:7b, not substantially smaller.

The observed behavior contradicts expectations based on architecture-derived KV cache size and parameter count.

Related issues that appear consistent with unexpected memory estimation or allocation behavior in Qwen 2.5 models:

Relevant log output

## Relevant log output

ollama ps output:


qwen2.5vl:7b    5ced39dfa4ba    15 GB    100% GPU     8192       4 minutes from now
qwen2.5vl:7b    5ced39dfa4ba    17 GB    12%/88% CPU/GPU    16384      4 minutes from now
qwen3-vl:8b     901cae732162    11 GB    100% GPU     32768      4 minutes from now


Model metadata:


ollama show qwen3-vl:8b
  parameters          8.8B
  embedding length    4096
  quantization        Q4_K_M

ollama show qwen2.5vl:7b
  parameters          8.3B
  embedding length    3584
  quantization        Q4_K_M

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.16.2

Originally created by @brendensoares on GitHub (Feb 18, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14312 ### What is the issue? ## What happened? When running both models locally with Ollama using the same quantization (`Q4_K_M`) and comparable context sizes, `qwen2.5vl:7b` consistently consumes significantly more total memory than `qwen3-vl:8b`. Observed behavior: - `qwen2.5vl:7b` - 8,192 context → ~15 GB - 16,384 context → ~17 GB - `qwen3-vl:8b` - 32,768 context → ~11 GB This occurs even though: - `qwen2.5vl:7b` has ~8.3B parameters - `qwen3-vl:8b` has ~8.8B parameters - Both use `Q4_K_M` quantization Additionally, doubling context for `qwen2.5vl:7b` (8k → 16k) increases memory by ~2 GB, which is significantly larger than what would be expected from KV cache growth alone based on published architecture parameters. From the model configs: Qwen2.5-VL-7B (text tower) - 28 layers - 4 KV heads - head dim = 128 Expected KV cache size at fp16: - ~0.44 GB @ 8k - ~0.88 GB @ 16k However, the observed memory increase (~2 GB when doubling context) suggests a ~4–5× multiplier over the theoretical KV size. This indicates that: - KV cache may be stored at higher precision (e.g., fp32), - KV may be duplicated or staged, - or there is unexpected additional runtime allocation for this model. ## What did you expect to happen? Given: - `qwen3-vl:8b` has more layers (36 vs 28), - more KV heads (8 vs 4), - larger hidden size (4096 vs 3584), The theoretical KV cache size for `qwen3-vl:8b` should be significantly larger than for `qwen2.5vl:7b` at the same context length. Therefore, total runtime memory usage for `qwen3-vl:8b` should be equal to or greater than `qwen2.5vl:7b`, not substantially smaller. The observed behavior contradicts expectations based on architecture-derived KV cache size and parameter count. Related issues that appear consistent with unexpected memory estimation or allocation behavior in Qwen 2.5 models: - https://github.com/ollama/ollama/issues/10163 - https://github.com/ollama/ollama/issues/13687 ### Relevant log output ```shell ## Relevant log output ollama ps output: qwen2.5vl:7b 5ced39dfa4ba 15 GB 100% GPU 8192 4 minutes from now qwen2.5vl:7b 5ced39dfa4ba 17 GB 12%/88% CPU/GPU 16384 4 minutes from now qwen3-vl:8b 901cae732162 11 GB 100% GPU 32768 4 minutes from now Model metadata: ollama show qwen3-vl:8b parameters 8.8B embedding length 4096 quantization Q4_K_M ollama show qwen2.5vl:7b parameters 8.3B embedding length 3584 quantization Q4_K_M ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.16.2
GiteaMirror added the bug label 2026-04-12 22:10:36 -05:00
Author
Owner

@somera commented on GitHub (Feb 21, 2026):

Same problems after upgrade to ollama v0.16.3

Feb 21 11:11:08 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:08.524+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:12[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:12(16..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:11:09 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:09.728+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:11[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:11(17..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:11:10 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:10.938+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:10[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:10(18..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:11:12 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:12.133+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:9[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:9(19..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:11:13 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:13.341+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:8[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:8(20..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:11:14 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:14.541+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:7[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:7(21..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:11:15 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:15.744+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:6[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:6(22..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:11:16 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:16.940+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:5[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:5(23..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:11:18 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:18.151+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:4[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:4(24..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:11:19 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:19.355+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:3[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:3(25..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:11:20 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:20.570+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:2[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:2(26..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:11:21 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:21.769+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:1[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:1(27..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

...

Feb 21 11:40:41 AI-DEV-VM-Neptun ollama[4269]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 37058.21 MiB on device 0: cudaMalloc failed: out of memory
Feb 21 11:40:41 AI-DEV-VM-Neptun ollama[4269]: ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 38858347648
Feb 21 11:40:42 AI-DEV-VM-Neptun ollama[4269]: time=2026-02-21T11:40:42.039+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:27[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:27(1..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:40:45 AI-DEV-VM-Neptun ollama[4269]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 37272.58 MiB on device 0: cudaMalloc failed: out of memory
Feb 21 11:40:45 AI-DEV-VM-Neptun ollama[4269]: ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 39083137024
Feb 21 11:40:45 AI-DEV-VM-Neptun ollama[4269]: time=2026-02-21T11:40:45.612+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:26[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:26(2..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 11:40:49 AI-DEV-VM-Neptun ollama[4269]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 37361.27 MiB on device 0: cudaMalloc failed: out of memory
Feb 21 11:40:49 AI-DEV-VM-Neptun ollama[4269]: ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 39176134656
Feb 21 11:40:49 AI-DEV-VM-Neptun ollama[4269]: time=2026-02-21T11:40:49.365+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:25[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:25(3..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
<!-- gh-comment-id:3938555153 --> @somera commented on GitHub (Feb 21, 2026): Same problems after upgrade to ollama v0.16.3 ``` Feb 21 11:11:08 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:08.524+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:12[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:12(16..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:11:09 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:09.728+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:11[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:11(17..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:11:10 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:10.938+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:10[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:10(18..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:11:12 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:12.133+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:9[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:9(19..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:11:13 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:13.341+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:8[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:8(20..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:11:14 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:14.541+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:7[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:7(21..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:11:15 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:15.744+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:6[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:6(22..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:11:16 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:16.940+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:5[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:5(23..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:11:18 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:18.151+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:4[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:4(24..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:11:19 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:19.355+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:3[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:3(25..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:11:20 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:20.570+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:2[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:2(26..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:11:21 AI-DEV-VM-Neptun ollama[805]: time=2026-02-21T11:11:21.769+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:1[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:1(27..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ... Feb 21 11:40:41 AI-DEV-VM-Neptun ollama[4269]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 37058.21 MiB on device 0: cudaMalloc failed: out of memory Feb 21 11:40:41 AI-DEV-VM-Neptun ollama[4269]: ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 38858347648 Feb 21 11:40:42 AI-DEV-VM-Neptun ollama[4269]: time=2026-02-21T11:40:42.039+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:27[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:27(1..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:40:45 AI-DEV-VM-Neptun ollama[4269]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 37272.58 MiB on device 0: cudaMalloc failed: out of memory Feb 21 11:40:45 AI-DEV-VM-Neptun ollama[4269]: ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 39083137024 Feb 21 11:40:45 AI-DEV-VM-Neptun ollama[4269]: time=2026-02-21T11:40:45.612+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:26[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:26(2..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 11:40:49 AI-DEV-VM-Neptun ollama[4269]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 37361.27 MiB on device 0: cudaMalloc failed: out of memory Feb 21 11:40:49 AI-DEV-VM-Neptun ollama[4269]: ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 39176134656 Feb 21 11:40:49 AI-DEV-VM-Neptun ollama[4269]: time=2026-02-21T11:40:49.365+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:128000 KvCacheType: NumThreads:12 GPULayers:25[ID:GPU-bf11243f-f6da-11f0-94c9-4cc7a902f6ae Layers:25(3..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9313