[GH-ISSUE #8310] llama3.2-vision doesn't utilize my GPU. #51832

New Issue

GiteaMirror · 2026-04-28T21:02:31-05:00

GiteaMirror commented

2026-04-28 21:02:31 -05:00

Originally created by @blueApple12 on GitHub (Jan 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8310

What is the issue?

I bought a new pc with 4070 Super to do some AI tasks using Ollama, but when I tried to run llama3.2-vision it just didn't utilize my GPU and only utilize my CPU, llama3.2 does utilize my GPU, so why is that? thank you.

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.5.4

Originally created by @blueApple12 on GitHub (Jan 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8310 ### What is the issue? I bought a new pc with 4070 Super to do some AI tasks using Ollama, but when I tried to run llama3.2-vision it just didn't utilize my GPU and only utilize my CPU, llama3.2 does utilize my GPU, so why is that? thank you. ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.5.4

GiteaMirror added the bug label 2026-04-28 21:02:31 -05:00

GiteaMirror closed this issue

2026-04-28 21:02:32 -05:00

GiteaMirror commented

2026-04-28 21:02:34 -05:00

@rick-github commented on GitHub (Jan 5, 2025):

Maybe not enough free VRAM on your system, depending on what else you are running. The output of nvidia-smi and server logs will aid in identifying the cause.

@rick-github commented on GitHub (Jan 5, 2025): Maybe not enough free VRAM on your system, depending on what else you are running. The output of `nvidia-smi` and [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in identifying the cause.

GiteaMirror commented

2026-04-28 21:02:34 -05:00

@blueApple12 commented on GitHub (Jan 5, 2025):

GiteaMirror commented

2026-04-28 21:02:35 -05:00

@rick-github commented on GitHub (Jan 5, 2025):

server logs will aid in identifying the cause.

@rick-github commented on GitHub (Jan 5, 2025): server logs will aid in identifying the cause.

GiteaMirror commented

2026-04-28 21:02:37 -05:00

@blueApple12 commented on GitHub (Jan 6, 2025):

this is my log:
2025/01/05 16:43:45 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\avish\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-01-05T16:43:45.920+02:00 level=INFO source=images.go:757 msg="total blobs: 12"
time=2025-01-05T16:43:45.926+02:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
time=2025-01-05T16:43:45.929+02:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)"
time=2025-01-05T16:43:45.930+02:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx cpu]"
time=2025-01-05T16:43:45.931+02:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-01-05T16:43:45.932+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-01-05T16:43:45.932+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-01-05T16:43:46.096+02:00 level=INFO source=amd_hip_windows.go:103 msg="AMD ROCm reports no devices found"
time=2025-01-05T16:43:46.096+02:00 level=INFO source=amd_windows.go:50 msg="no compatible amdgpu devices detected"
time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
[GIN] 2025/01/05 - 16:43:59 | 200 | 500µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:43:59 | 200 | 2.5007ms | 127.0.0.1 | GET "/api/tags"
time=2025-01-05T16:44:53.262+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T16:44:53.331+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.4 GiB" free_swap="10.1 GiB"
time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T16:44:53.346+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59288"
time=2025-01-05T16:44:53.352+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:44:53.353+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:44:53.353+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:44:53.387+02:00 level=INFO source=runner.go:945 msg="starting go runner"
time=2025-01-05T16:44:53.404+02:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:44:53.406+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59288"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
time=2025-01-05T16:44:53.607+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: CPU model buffer size = 5679.33 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 656.25 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CPU backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T16:45:03.649+02:00 level=INFO source=server.go:594 msg="llama runner started in 10.30 seconds"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:45:09 | 200 | 16.5585361s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:45:09.804+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:45:19 | 200 | 9.9060809s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:45:36.380+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:45:41 | 200 | 5.0508025s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:45:41.512+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:45:45 | 500 | 4.0971502s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:47:36.667+02:00 level=INFO source=runner.go:662 msg="aborting completion request due to client closing the connection"
time=2025-01-05T16:47:38.948+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:47:43 | 200 | 4.9453625s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:47:43.887+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:47:52 | 200 | 8.9424866s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:08.430+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:48:14 | 200 | 5.8496372s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:14.287+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:48:27 | 200 | 13.5327677s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:45.398+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:48:51 | 200 | 5.8480718s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:51.241+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:49:08 | 200 | 17.0670151s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:49:41.721+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:49:47 | 200 | 5.3733708s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:49:47.151+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:50:14 | 500 | 27.8632648s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:50:22 | 200 | 997.4µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:50:22 | 200 | 26.5029ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:50:22.893+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:50:22 | 200 | 24.5118ms | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T16:50:24.739+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:50:40 | 200 | 15.6020101s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:51:35 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:51:35 | 200 | 62.5791ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:51:35.660+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="10.5 GiB"
time=2025-01-05T16:51:35.661+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11251539968 required="3.7 GiB"
time=2025-01-05T16:51:35.683+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.8 GiB" free_swap="19.7 GiB"
time=2025-01-05T16:51:35.683+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T16:51:35.688+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 59523"
time=2025-01-05T16:51:35.695+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=2
time=2025-01-05T16:51:35.695+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:51:35.695+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:51:36.419+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:51:36.460+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:51:36.463+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59523"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
time=2025-01-05T16:51:36.702+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 4
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T16:51:38.959+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.26 seconds"
[GIN] 2025/01/05 - 16:51:38 | 200 | 3.364194s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/01/05 - 16:51:51 | 200 | 676.757ms | 127.0.0.1 | POST "/api/chat"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:51:57 | 200 | 1.1356145s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:52:11 | 200 | 5.1867467s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:52:20 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:52:20 | 200 | 15.4987ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:52:20.927+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T16:52:25.947+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0186925 model=C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
time=2025-01-05T16:52:26.009+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.1 GiB"
time=2025-01-05T16:52:26.196+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2687383 model=C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
time=2025-01-05T16:52:26.356+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.8 GiB" free_swap="19.7 GiB"
time=2025-01-05T16:52:26.358+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=33 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.3 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T16:52:26.363+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 33 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59564"
time=2025-01-05T16:52:26.368+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:52:26.368+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:52:26.368+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:52:26.470+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:52:26.509+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:52:26.510+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59564"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
time=2025-01-05T16:52:26.620+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 33 repeating layers to GPU
llm_load_tensors: offloaded 33/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1306.52 MiB
llm_load_tensors: CUDA0 model buffer size = 4090.98 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 48.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 558.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 71 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T16:52:35.902+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.53 seconds"
[GIN] 2025/01/05 - 16:52:35 | 200 | 14.9908729s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T16:52:44.743+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:52:54 | 200 | 10.2073533s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:53:23.080+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:53:35 | 200 | 12.800509s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:55:02.546+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="644.1 MiB"
time=2025-01-05T16:55:02.868+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11424337920 required="3.7 GiB"
time=2025-01-05T16:55:02.891+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.2 GiB" free_swap="19.5 GiB"
time=2025-01-05T16:55:02.891+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T16:55:02.896+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 59620"
time=2025-01-05T16:55:02.900+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:55:02.900+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:55:02.901+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:55:02.994+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:55:03.031+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:55:03.032+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59620"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
time=2025-01-05T16:55:03.152+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 4
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T16:55:03.904+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.00 seconds"
[GIN] 2025/01/05 - 16:55:07 | 200 | 4.7508347s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:55:38 | 200 | 15.5506907s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:56:03 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:56:03 | 404 | 497.8µs | 127.0.0.1 | POST "/api/show"
[GIN] 2025/01/05 - 16:56:04 | 200 | 1.0677276s | 127.0.0.1 | POST "/api/pull"
[GIN] 2025/01/05 - 16:56:12 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:56:12 | 200 | 16.0004ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:56:12.092+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T16:56:12.138+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.2 GiB"
time=2025-01-05T16:56:12.485+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.4 GiB" free_swap="19.7 GiB"
time=2025-01-05T16:56:12.488+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=34 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.5 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.5 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T16:56:12.492+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 34 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59649"
time=2025-01-05T16:56:12.497+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:56:12.497+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:56:12.497+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:56:12.583+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:56:12.618+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:56:12.619+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59649"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-01-05T16:56:12.748+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 34 repeating layers to GPU
llm_load_tensors: offloaded 34/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1189.49 MiB
llm_load_tensors: CUDA0 model buffer size = 4208.01 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 40.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 566.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 60 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T16:56:15.507+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.01 seconds"
[GIN] 2025/01/05 - 16:56:15 | 200 | 3.4331242s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T16:56:16.504+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:56:17 | 200 | 1.1631321s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:56:25.494+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:56:32 | 200 | 7.4890539s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:57:25 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:57:25 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:17:12 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:17:12 | 200 | 16.5005ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:17:12.973+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T17:17:13.029+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.5 GiB" free_swap="19.4 GiB"
time=2025-01-05T17:17:13.033+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T17:17:13.038+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 32 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59968"
time=2025-01-05T17:17:13.043+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:17:13.043+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:17:13.043+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:17:13.146+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:17:13.182+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:17:13.183+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59968"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
time=2025-01-05T17:17:13.295+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1439.02 MiB
llm_load_tensors: CUDA0 model buffer size = 3958.48 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 56.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 550.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 82 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T17:17:16.304+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.26 seconds"
[GIN] 2025/01/05 - 17:17:16 | 200 | 3.3481249s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T17:17:18.345+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:17:19 | 200 | 690.9205ms | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T17:17:31.745+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 17:17:37 | 200 | 5.7424355s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:17:49 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:17:49 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
time=2025-01-05T17:32:26.860+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=1 available=3969449984 required="2.9 GiB"
time=2025-01-05T17:32:26.881+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.6 GiB" free_swap="9.6 GiB"
time=2025-01-05T17:32:26.882+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.9 GiB" memory.required.partial="2.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[2.9 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T17:32:26.887+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 1 --port 60186"
time=2025-01-05T17:32:26.892+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:32:26.892+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:32:26.892+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:32:27.004+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:32:27.040+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:32:27.041+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:60186"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
time=2025-01-05T17:32:27.144+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 224.00 MiB
llama_new_context_with_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 256.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T17:32:28.148+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.26 seconds"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 17:32:28 | 200 | 1.9139322s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T17:32:28.741+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T17:32:28.779+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="1.1 GiB"
time=2025-01-05T17:32:29.126+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.6 GiB" free_swap="9.6 GiB"
time=2025-01-05T17:32:29.130+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T17:32:29.131+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 60190"
time=2025-01-05T17:32:29.138+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:32:29.138+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:32:29.138+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:32:29.154+02:00 level=INFO source=runner.go:945 msg="starting go runner"
time=2025-01-05T17:32:29.171+02:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:32:29.172+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:60190"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
time=2025-01-05T17:32:29.389+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: CPU model buffer size = 5679.33 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 656.25 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CPU backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T17:32:38.426+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.29 seconds"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 17:32:50 | 200 | 22.0606032s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:44:48 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:44:48 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:44:58 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:44:58 | 200 | 16.9993ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:44:58.564+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11475161088 required="3.7 GiB"
time=2025-01-05T17:44:58.585+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.3 GiB" free_swap="17.9 GiB"
time=2025-01-05T17:44:58.585+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T17:44:58.590+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 63944"
time=2025-01-05T17:44:58.599+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:44:58.599+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:44:58.599+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:44:58.726+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:44:58.763+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:44:58.764+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63944"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
time=2025-01-05T17:44:58.850+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 4
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T17:45:00.104+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.51 seconds"
[GIN] 2025/01/05 - 17:45:00 | 200 | 1.5937332s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/01/05 - 17:45:04 | 200 | 648.2546ms | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:45:11 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:11 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:45:20 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:20 | 200 | 16.4985ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:45:20.728+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T17:45:20.764+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.3 GiB"
time=2025-01-05T17:45:21.110+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.4 GiB" free_swap="18.0 GiB"
time=2025-01-05T17:45:21.112+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=35 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.6 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.6 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T17:45:21.118+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 35 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 63956"
time=2025-01-05T17:45:21.123+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:45:21.123+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:45:21.123+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:45:21.205+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:45:21.240+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:45:21.241+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63956"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-01-05T17:45:21.375+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 35 repeating layers to GPU
llm_load_tensors: offloaded 35/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1072.46 MiB
llm_load_tensors: CUDA0 model buffer size = 4325.04 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 32.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 574.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 49 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T17:45:24.635+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.51 seconds"
[GIN] 2025/01/05 - 17:45:24 | 200 | 3.922538s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T17:45:27.666+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:45:28 | 200 | 1.3313811s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:45:31 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:31 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:45:43 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:43 | 200 | 20.9989ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:45:43.977+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:45:43 | 200 | 16.001ms | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T17:45:56.150+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:46:33 | 200 | 37.3366449s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:46:36 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:46:36 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:49:35 | 200 | 500.9µs | 127.0.0.1 | GET "/api/version"
[GIN] 2025/01/06 - 17:54:18 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/06 - 17:54:18 | 200 | 16.4978ms | 127.0.0.1 | POST "/api/show"
time=2025-01-06T17:54:18.334+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-06T17:54:18.418+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.8 GiB" free_swap="18.0 GiB"
time=2025-01-06T17:54:18.421+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=36 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.8 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-06T17:54:18.426+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 36 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 57521"
time=2025-01-06T17:54:18.436+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-06T17:54:18.436+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-06T17:54:18.437+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-06T17:54:18.577+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-06T17:54:18.622+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-06T17:54:18.623+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:57521"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
time=2025-01-06T17:54:18.688+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 36 repeating layers to GPU
llm_load_tensors: offloaded 36/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 939.96 MiB
llm_load_tensors: CUDA0 model buffer size = 4457.54 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 24.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 582.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 38 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-06T17:54:22.199+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.76 seconds"
[GIN] 2025/01/06 - 17:54:22 | 200 | 3.8828603s | 127.0.0.1 | POST "/api/generate"
time=2025-01-06T17:54:26.611+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/06 - 17:54:27 | 200 | 1.042562s | 127.0.0.1 | POST "/api/chat"
time=2025-01-06T17:54:36.996+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/06 - 17:55:04 | 200 | 27.6056929s | 127.0.0.1 | POST "/api/chat"

@blueApple12 commented on GitHub (Jan 6, 2025): this is my log: 2025/01/05 16:43:45 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\avish\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-01-05T16:43:45.920+02:00 level=INFO source=images.go:757 msg="total blobs: 12" time=2025-01-05T16:43:45.926+02:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0" time=2025-01-05T16:43:45.929+02:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)" time=2025-01-05T16:43:45.930+02:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx cpu]" time=2025-01-05T16:43:45.931+02:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2025-01-05T16:43:45.932+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-01-05T16:43:45.932+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12 time=2025-01-05T16:43:46.096+02:00 level=INFO source=amd_hip_windows.go:103 msg="AMD ROCm reports no devices found" time=2025-01-05T16:43:46.096+02:00 level=INFO source=amd_windows.go:50 msg="no compatible amdgpu devices detected" time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB" [GIN] 2025/01/05 - 16:43:59 | 200 | 500µs | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 16:43:59 | 200 | 2.5007ms | 127.0.0.1 | GET "/api/tags" time=2025-01-05T16:44:53.262+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" time=2025-01-05T16:44:53.331+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.4 GiB" free_swap="10.1 GiB" time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" time=2025-01-05T16:44:53.346+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\avish\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe runner --model C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59288" time=2025-01-05T16:44:53.352+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-05T16:44:53.353+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-05T16:44:53.353+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-05T16:44:53.387+02:00 level=INFO source=runner.go:945 msg="starting go runner" time=2025-01-05T16:44:53.404+02:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6 time=2025-01-05T16:44:53.406+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59288" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors time=2025-01-05T16:44:53.607+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 11B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llm_load_tensors: CPU model buffer size = 5679.33 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 656.25 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CPU compute buffer size = 258.50 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CPU backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-01-05T16:45:03.649+02:00 level=INFO source=server.go:594 msg="llama runner started in 10.30 seconds" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llama_model_load: vocab only - skipping tensors [GIN] 2025/01/05 - 16:45:09 | 200 | 16.5585361s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:45:09.804+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:45:19 | 200 | 9.9060809s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:45:36.380+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:45:41 | 200 | 5.0508025s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:45:41.512+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:45:45 | 500 | 4.0971502s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:47:36.667+02:00 level=INFO source=runner.go:662 msg="aborting completion request due to client closing the connection" time=2025-01-05T16:47:38.948+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:47:43 | 200 | 4.9453625s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:47:43.887+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:47:52 | 200 | 8.9424866s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:48:08.430+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:48:14 | 200 | 5.8496372s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:48:14.287+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:48:27 | 200 | 13.5327677s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:48:45.398+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:48:51 | 200 | 5.8480718s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:48:51.241+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:49:08 | 200 | 17.0670151s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:49:41.721+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:49:47 | 200 | 5.3733708s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:49:47.151+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:50:14 | 500 | 27.8632648s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 16:50:22 | 200 | 997.4µs | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 16:50:22 | 200 | 26.5029ms | 127.0.0.1 | POST "/api/show" time=2025-01-05T16:50:22.893+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:50:22 | 200 | 24.5118ms | 127.0.0.1 | POST "/api/generate" time=2025-01-05T16:50:24.739+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:50:40 | 200 | 15.6020101s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 16:51:35 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 16:51:35 | 200 | 62.5791ms | 127.0.0.1 | POST "/api/show" time=2025-01-05T16:51:35.660+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="10.5 GiB" time=2025-01-05T16:51:35.661+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11251539968 required="3.7 GiB" time=2025-01-05T16:51:35.683+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.8 GiB" free_swap="19.7 GiB" time=2025-01-05T16:51:35.683+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" time=2025-01-05T16:51:35.688+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\avish\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 59523" time=2025-01-05T16:51:35.695+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=2 time=2025-01-05T16:51:35.695+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-05T16:51:35.695+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-05T16:51:36.419+02:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes time=2025-01-05T16:51:36.460+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6 time=2025-01-05T16:51:36.463+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59523" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors time=2025-01-05T16:51:36.702+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU model buffer size = 308.23 MiB llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB llama_new_context_with_model: n_seq_max = 4 llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB llama_new_context_with_model: graph nodes = 902 llama_new_context_with_model: graph splits = 2 time=2025-01-05T16:51:38.959+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.26 seconds" [GIN] 2025/01/05 - 16:51:38 | 200 | 3.364194s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/01/05 - 16:51:51 | 200 | 676.757ms | 127.0.0.1 | POST "/api/chat" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab only - skipping tensors [GIN] 2025/01/05 - 16:51:57 | 200 | 1.1356145s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 16:52:11 | 200 | 5.1867467s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 16:52:20 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 16:52:20 | 200 | 15.4987ms | 127.0.0.1 | POST "/api/show" time=2025-01-05T16:52:20.927+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" time=2025-01-05T16:52:25.947+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0186925 model=C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 time=2025-01-05T16:52:26.009+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.1 GiB" time=2025-01-05T16:52:26.196+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2687383 model=C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 time=2025-01-05T16:52:26.356+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.8 GiB" free_swap="19.7 GiB" time=2025-01-05T16:52:26.358+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=33 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.3 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" time=2025-01-05T16:52:26.363+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\avish\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 33 --mmproj C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59564" time=2025-01-05T16:52:26.368+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-05T16:52:26.368+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-05T16:52:26.368+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-05T16:52:26.470+02:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes time=2025-01-05T16:52:26.509+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6 time=2025-01-05T16:52:26.510+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59564" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe time=2025-01-05T16:52:26.620+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 11B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llm_load_tensors: offloading 33 repeating layers to GPU llm_load_tensors: offloaded 33/41 layers to GPU llm_load_tensors: CPU model buffer size = 281.83 MiB llm_load_tensors: CUDA_Host model buffer size = 1306.52 MiB llm_load_tensors: CUDA0 model buffer size = 4090.98 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 48.00 MiB llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB llama_kv_cache_init: CUDA0 KV buffer size = 558.22 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 71 (with bs=512), 3 (with bs=1) mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CUDA backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-01-05T16:52:35.902+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.53 seconds" [GIN] 2025/01/05 - 16:52:35 | 200 | 14.9908729s | 127.0.0.1 | POST "/api/generate" time=2025-01-05T16:52:44.743+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:52:54 | 200 | 10.2073533s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:53:23.080+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llama_model_load: vocab only - skipping tensors [GIN] 2025/01/05 - 16:53:35 | 200 | 12.800509s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:55:02.546+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="644.1 MiB" time=2025-01-05T16:55:02.868+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11424337920 required="3.7 GiB" time=2025-01-05T16:55:02.891+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.2 GiB" free_swap="19.5 GiB" time=2025-01-05T16:55:02.891+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" time=2025-01-05T16:55:02.896+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\avish\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 59620" time=2025-01-05T16:55:02.900+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-05T16:55:02.900+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-05T16:55:02.901+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-05T16:55:02.994+02:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes time=2025-01-05T16:55:03.031+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6 time=2025-01-05T16:55:03.032+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59620" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe time=2025-01-05T16:55:03.152+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU model buffer size = 308.23 MiB llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB llama_new_context_with_model: n_seq_max = 4 llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB llama_new_context_with_model: graph nodes = 902 llama_new_context_with_model: graph splits = 2 time=2025-01-05T16:55:03.904+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.00 seconds" [GIN] 2025/01/05 - 16:55:07 | 200 | 4.7508347s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 16:55:38 | 200 | 15.5506907s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 16:56:03 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 16:56:03 | 404 | 497.8µs | 127.0.0.1 | POST "/api/show" [GIN] 2025/01/05 - 16:56:04 | 200 | 1.0677276s | 127.0.0.1 | POST "/api/pull" [GIN] 2025/01/05 - 16:56:12 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 16:56:12 | 200 | 16.0004ms | 127.0.0.1 | POST "/api/show" time=2025-01-05T16:56:12.092+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" time=2025-01-05T16:56:12.138+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.2 GiB" time=2025-01-05T16:56:12.485+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.4 GiB" free_swap="19.7 GiB" time=2025-01-05T16:56:12.488+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=34 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.5 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.5 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" time=2025-01-05T16:56:12.492+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\avish\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 34 --mmproj C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59649" time=2025-01-05T16:56:12.497+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-05T16:56:12.497+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-05T16:56:12.497+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-05T16:56:12.583+02:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes time=2025-01-05T16:56:12.618+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6 time=2025-01-05T16:56:12.619+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59649" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... time=2025-01-05T16:56:12.748+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 11B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llm_load_tensors: offloading 34 repeating layers to GPU llm_load_tensors: offloaded 34/41 layers to GPU llm_load_tensors: CPU model buffer size = 281.83 MiB llm_load_tensors: CUDA_Host model buffer size = 1189.49 MiB llm_load_tensors: CUDA0 model buffer size = 4208.01 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 40.00 MiB llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB llama_kv_cache_init: CUDA0 KV buffer size = 566.22 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 60 (with bs=512), 3 (with bs=1) mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CUDA backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-01-05T16:56:15.507+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.01 seconds" [GIN] 2025/01/05 - 16:56:15 | 200 | 3.4331242s | 127.0.0.1 | POST "/api/generate" time=2025-01-05T16:56:16.504+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 16:56:17 | 200 | 1.1631321s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T16:56:25.494+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llama_model_load: vocab only - skipping tensors [GIN] 2025/01/05 - 16:56:32 | 200 | 7.4890539s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 16:57:25 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 16:57:25 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/01/05 - 17:17:12 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 17:17:12 | 200 | 16.5005ms | 127.0.0.1 | POST "/api/show" time=2025-01-05T17:17:12.973+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" time=2025-01-05T17:17:13.029+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.5 GiB" free_swap="19.4 GiB" time=2025-01-05T17:17:13.033+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" time=2025-01-05T17:17:13.038+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\avish\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 32 --mmproj C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59968" time=2025-01-05T17:17:13.043+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-05T17:17:13.043+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-05T17:17:13.043+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-05T17:17:13.146+02:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes time=2025-01-05T17:17:13.182+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6 time=2025-01-05T17:17:13.183+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59968" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe time=2025-01-05T17:17:13.295+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 11B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloaded 32/41 layers to GPU llm_load_tensors: CPU model buffer size = 281.83 MiB llm_load_tensors: CUDA_Host model buffer size = 1439.02 MiB llm_load_tensors: CUDA0 model buffer size = 3958.48 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 56.00 MiB llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB llama_kv_cache_init: CUDA0 KV buffer size = 550.22 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 82 (with bs=512), 3 (with bs=1) mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CUDA backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-01-05T17:17:16.304+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.26 seconds" [GIN] 2025/01/05 - 17:17:16 | 200 | 3.3481249s | 127.0.0.1 | POST "/api/generate" time=2025-01-05T17:17:18.345+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 17:17:19 | 200 | 690.9205ms | 127.0.0.1 | POST "/api/chat" time=2025-01-05T17:17:31.745+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llama_model_load: vocab only - skipping tensors [GIN] 2025/01/05 - 17:17:37 | 200 | 5.7424355s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 17:17:49 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 17:17:49 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-01-05T17:32:26.860+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=1 available=3969449984 required="2.9 GiB" time=2025-01-05T17:32:26.881+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.6 GiB" free_swap="9.6 GiB" time=2025-01-05T17:32:26.882+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.9 GiB" memory.required.partial="2.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[2.9 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB" time=2025-01-05T17:32:26.887+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\avish\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 1 --port 60186" time=2025-01-05T17:32:26.892+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-05T17:32:26.892+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-05T17:32:26.892+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-05T17:32:27.004+02:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes time=2025-01-05T17:32:27.040+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6 time=2025-01-05T17:32:27.041+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:60186" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free time=2025-01-05T17:32:27.144+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU model buffer size = 308.23 MiB llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CUDA0 KV buffer size = 224.00 MiB llama_new_context_with_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 256.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB llama_new_context_with_model: graph nodes = 902 llama_new_context_with_model: graph splits = 2 time=2025-01-05T17:32:28.148+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.26 seconds" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab only - skipping tensors [GIN] 2025/01/05 - 17:32:28 | 200 | 1.9139322s | 127.0.0.1 | POST "/api/chat" time=2025-01-05T17:32:28.741+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" time=2025-01-05T17:32:28.779+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="1.1 GiB" time=2025-01-05T17:32:29.126+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.6 GiB" free_swap="9.6 GiB" time=2025-01-05T17:32:29.130+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" time=2025-01-05T17:32:29.131+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\avish\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe runner --model C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 60190" time=2025-01-05T17:32:29.138+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-05T17:32:29.138+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-05T17:32:29.138+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-05T17:32:29.154+02:00 level=INFO source=runner.go:945 msg="starting go runner" time=2025-01-05T17:32:29.171+02:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6 time=2025-01-05T17:32:29.172+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:60190" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors time=2025-01-05T17:32:29.389+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 11B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llm_load_tensors: CPU model buffer size = 5679.33 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 656.25 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CPU compute buffer size = 258.50 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CPU backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-01-05T17:32:38.426+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.29 seconds" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llama_model_load: vocab only - skipping tensors [GIN] 2025/01/05 - 17:32:50 | 200 | 22.0606032s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 17:44:48 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 17:44:48 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/01/05 - 17:44:58 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 17:44:58 | 200 | 16.9993ms | 127.0.0.1 | POST "/api/show" time=2025-01-05T17:44:58.564+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11475161088 required="3.7 GiB" time=2025-01-05T17:44:58.585+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.3 GiB" free_swap="17.9 GiB" time=2025-01-05T17:44:58.585+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" time=2025-01-05T17:44:58.590+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\avish\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 63944" time=2025-01-05T17:44:58.599+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-05T17:44:58.599+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-05T17:44:58.599+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-05T17:44:58.726+02:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes time=2025-01-05T17:44:58.763+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6 time=2025-01-05T17:44:58.764+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63944" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free time=2025-01-05T17:44:58.850+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU model buffer size = 308.23 MiB llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB llama_new_context_with_model: n_seq_max = 4 llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB llama_new_context_with_model: graph nodes = 902 llama_new_context_with_model: graph splits = 2 time=2025-01-05T17:45:00.104+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.51 seconds" [GIN] 2025/01/05 - 17:45:00 | 200 | 1.5937332s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/01/05 - 17:45:04 | 200 | 648.2546ms | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 17:45:11 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 17:45:11 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/01/05 - 17:45:20 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 17:45:20 | 200 | 16.4985ms | 127.0.0.1 | POST "/api/show" time=2025-01-05T17:45:20.728+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" time=2025-01-05T17:45:20.764+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.3 GiB" time=2025-01-05T17:45:21.110+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.4 GiB" free_swap="18.0 GiB" time=2025-01-05T17:45:21.112+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=35 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.6 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.6 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" time=2025-01-05T17:45:21.118+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\avish\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 35 --mmproj C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 63956" time=2025-01-05T17:45:21.123+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-05T17:45:21.123+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-05T17:45:21.123+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-05T17:45:21.205+02:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes time=2025-01-05T17:45:21.240+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6 time=2025-01-05T17:45:21.241+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63956" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... time=2025-01-05T17:45:21.375+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 11B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llm_load_tensors: offloading 35 repeating layers to GPU llm_load_tensors: offloaded 35/41 layers to GPU llm_load_tensors: CPU model buffer size = 281.83 MiB llm_load_tensors: CUDA_Host model buffer size = 1072.46 MiB llm_load_tensors: CUDA0 model buffer size = 4325.04 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 32.00 MiB llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB llama_kv_cache_init: CUDA0 KV buffer size = 574.22 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 49 (with bs=512), 3 (with bs=1) mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CUDA backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-01-05T17:45:24.635+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.51 seconds" [GIN] 2025/01/05 - 17:45:24 | 200 | 3.922538s | 127.0.0.1 | POST "/api/generate" time=2025-01-05T17:45:27.666+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 17:45:28 | 200 | 1.3313811s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 17:45:31 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 17:45:31 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/01/05 - 17:45:43 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 17:45:43 | 200 | 20.9989ms | 127.0.0.1 | POST "/api/show" time=2025-01-05T17:45:43.977+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 17:45:43 | 200 | 16.001ms | 127.0.0.1 | POST "/api/generate" time=2025-01-05T17:45:56.150+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/05 - 17:46:33 | 200 | 37.3366449s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/01/05 - 17:46:36 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/05 - 17:46:36 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/01/05 - 17:49:35 | 200 | 500.9µs | 127.0.0.1 | GET "/api/version" [GIN] 2025/01/06 - 17:54:18 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/06 - 17:54:18 | 200 | 16.4978ms | 127.0.0.1 | POST "/api/show" time=2025-01-06T17:54:18.334+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" time=2025-01-06T17:54:18.418+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.8 GiB" free_swap="18.0 GiB" time=2025-01-06T17:54:18.421+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=36 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.8 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" time=2025-01-06T17:54:18.426+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\avish\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 36 --mmproj C:\\Users\\avish\\.ollama\\models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 57521" time=2025-01-06T17:54:18.436+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-06T17:54:18.436+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-06T17:54:18.437+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-06T17:54:18.577+02:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes time=2025-01-06T17:54:18.622+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6 time=2025-01-06T17:54:18.623+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:57521" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free time=2025-01-06T17:54:18.688+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 11B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llm_load_tensors: offloading 36 repeating layers to GPU llm_load_tensors: offloaded 36/41 layers to GPU llm_load_tensors: CPU model buffer size = 281.83 MiB llm_load_tensors: CUDA_Host model buffer size = 939.96 MiB llm_load_tensors: CUDA0 model buffer size = 4457.54 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 24.00 MiB llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB llama_kv_cache_init: CUDA0 KV buffer size = 582.22 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 38 (with bs=512), 3 (with bs=1) mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CUDA backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-01-06T17:54:22.199+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.76 seconds" [GIN] 2025/01/06 - 17:54:22 | 200 | 3.8828603s | 127.0.0.1 | POST "/api/generate" time=2025-01-06T17:54:26.611+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2025/01/06 - 17:54:27 | 200 | 1.042562s | 127.0.0.1 | POST "/api/chat" time=2025-01-06T17:54:36.996+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llama_model_load: vocab only - skipping tensors [GIN] 2025/01/06 - 17:55:04 | 200 | 27.6056929s | 127.0.0.1 | POST "/api/chat"

GiteaMirror commented

2026-04-28 21:02:39 -05:00

@ChandlerHooley commented on GitHub (Jan 7, 2025):

Having this same issue as well. Latest version of Ollama and NVIDIA GTX 1650 SUPER graphics card. (Yes, I know it isn't powerful, this is just for a POC). Here are my logs when I run the "ollama serve" and then in another window the "ollama run llama3.2-vision" command. If I can provide any other information that would help, please let me know.

2025/01/06 22:07:11 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\chand\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-01-06T22:07:11.022-06:00 level=INFO source=images.go:757 msg="total blobs: 11"
time=2025-01-06T22:07:11.023-06:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx]"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-01-06T22:07:11.158-06:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dd59afa0-5797-0eb5-41fa-a4e67e77623f library=cuda variant=v12 compute=7.5 driver=12.6 name="NVIDIA GeForce GTX 1650 SUPER" total="4.0 GiB" available="3.2 GiB"
[GIN] 2025/01/06 - 22:07:22 | 200 | 544.5µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/06 - 22:07:22 | 200 | 54.7107ms | 127.0.0.1 | POST "/api/show"
time=2025-01-06T22:07:22.747-06:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-06T22:07:22.793-06:00 level=INFO source=server.go:104 msg="system memory" total="63.7 GiB" free="42.7 GiB" free_swap="45.0 GiB"
time=2025-01-06T22:07:22.796-06:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[2.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-06T22:07:22.802-06:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\chand\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe runner --model C:\Users\chand\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\Users\chand\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 58622"
time=2025-01-06T22:07:22.960-06:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-06T22:07:22.960-06:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-06T22:07:22.962-06:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-06T22:07:22.967-06:00 level=INFO source=runner.go:945 msg="starting go runner"
time=2025-01-06T22:07:22.969-06:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8
time=2025-01-06T22:07:22.970-06:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:58622"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\chand.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
time=2025-01-06T22:07:23.213-06:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: CPU model buffer size = 5679.33 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 656.25 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CPU backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-06T22:07:37.235-06:00 level=INFO source=server.go:594 msg="llama runner started in 14.27 seconds"
[GIN] 2025/01/06 - 22:07:37 | 200 | 14.5041464s | 127.0.0.1 | POST "/api/generate"

@ChandlerHooley commented on GitHub (Jan 7, 2025): Having this same issue as well. Latest version of Ollama and NVIDIA GTX 1650 SUPER graphics card. (Yes, I know it isn't powerful, this is just for a POC). Here are my logs when I run the "ollama serve" and then in another window the "ollama run llama3.2-vision" command. If I can provide any other information that would help, please let me know. 2025/01/06 22:07:11 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\chand\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-01-06T22:07:11.022-06:00 level=INFO source=images.go:757 msg="total blobs: 11" time=2025-01-06T22:07:11.023-06:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0" time=2025-01-06T22:07:11.024-06:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)" time=2025-01-06T22:07:11.024-06:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx]" time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-01-06T22:07:11.158-06:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dd59afa0-5797-0eb5-41fa-a4e67e77623f library=cuda variant=v12 compute=7.5 driver=12.6 name="NVIDIA GeForce GTX 1650 SUPER" total="4.0 GiB" available="3.2 GiB" [GIN] 2025/01/06 - 22:07:22 | 200 | 544.5µs | 127.0.0.1 | HEAD "/" [GIN] 2025/01/06 - 22:07:22 | 200 | 54.7107ms | 127.0.0.1 | POST "/api/show" time=2025-01-06T22:07:22.747-06:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" time=2025-01-06T22:07:22.793-06:00 level=INFO source=server.go:104 msg="system memory" total="63.7 GiB" free="42.7 GiB" free_swap="45.0 GiB" time=2025-01-06T22:07:22.796-06:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[2.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" time=2025-01-06T22:07:22.802-06:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\chand\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe runner --model C:\\Users\\chand\\.ollama\\models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\\Users\\chand\\.ollama\\models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 58622" time=2025-01-06T22:07:22.960-06:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-06T22:07:22.960-06:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-06T22:07:22.962-06:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-06T22:07:22.967-06:00 level=INFO source=runner.go:945 msg="starting go runner" time=2025-01-06T22:07:22.969-06:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8 time=2025-01-06T22:07:22.970-06:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:58622" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\chand\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors time=2025-01-06T22:07:23.213-06:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 11B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 5.55 GiB (4.87 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llm_load_tensors: CPU model buffer size = 5679.33 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 656.25 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CPU compute buffer size = 258.50 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CPU backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-01-06T22:07:37.235-06:00 level=INFO source=server.go:594 msg="llama runner started in 14.27 seconds" [GIN] 2025/01/06 - 22:07:37 | 200 | 14.5041464s | 127.0.0.1 | POST "/api/generate"

GiteaMirror commented

2026-04-28 21:02:40 -05:00

@rick-github commented on GitHub (Jan 7, 2025):

time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"

When ollama started, there was 10.8G free VRAM. When it came time to load a model, something else was running and only 3.5G was free. The llama3.2-vision model won't fit, so it loads it into RAM.

time=2025-01-05T16:52:26.358+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=33 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.3 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"

The model is unloaded after 5 minutes and then a bit later another request comes in for the model. This time there is 10.5G available and ollama does a partial load (33 of 41 layers) into the GPU.

NAME                      ID              SIZE     PROCESSOR    UNTIL   
llama3.2-vision:latest    085a1fdae525    12 GB    100% GPU     Forever

Your GPU is too small to host the entire model, and other GPU users are occasionally taking VRAM to the point where ollama can't even do a partial load.

@rick-github commented on GitHub (Jan 7, 2025): ``` time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB" time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" ``` When ollama started, there was 10.8G free VRAM. When it came time to load a model, something else was running and only 3.5G was free. The llama3.2-vision model won't fit, so it loads it into RAM. ``` time=2025-01-05T16:52:26.358+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=33 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.3 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" ``` The model is unloaded after 5 minutes and then a bit later another request comes in for the model. This time there is 10.5G available and ollama does a partial load (33 of 41 layers) into the GPU. ```console NAME ID SIZE PROCESSOR UNTIL llama3.2-vision:latest 085a1fdae525 12 GB 100% GPU Forever ``` Your GPU is too small to host the entire model, and other GPU users are occasionally taking VRAM to the point where ollama can't even do a partial load.

GiteaMirror commented

2026-04-28 21:02:41 -05:00

@blueApple12 commented on GitHub (Jan 7, 2025):

So I don't have enough vram?

@blueApple12 commented on GitHub (Jan 7, 2025): So I don't have enough vram?

GiteaMirror commented

2026-04-28 21:02:42 -05:00

@rick-github commented on GitHub (Jan 7, 2025):

Correct.

@rick-github commented on GitHub (Jan 7, 2025): Correct.

GiteaMirror commented

2026-04-28 21:02:44 -05:00

@blueApple12 commented on GitHub (Jan 7, 2025):

Is there a way to get around this?

@blueApple12 commented on GitHub (Jan 7, 2025): Is there a way to get around this?

GiteaMirror commented

2026-04-28 21:02:45 -05:00

@rick-github commented on GitHub (Jan 7, 2025):

Stop other applications from using the GPU.

https://www.google.com/search?q=windows+switch+default+gpu+to+integrated
https://www.google.com/search?q=windows+restrict+process+from+using+gpu

@rick-github commented on GitHub (Jan 7, 2025): Stop other applications from using the GPU. https://www.google.com/search?q=windows+switch+default+gpu+to+integrated https://www.google.com/search?q=windows+restrict+process+from+using+gpu

GiteaMirror commented

2026-04-28 21:02:45 -05:00

@blueApple12 commented on GitHub (Jan 7, 2025):

Is there any other way to use less vram like low vram mode?

@blueApple12 commented on GitHub (Jan 7, 2025): Is there any other way to use less vram like low vram mode?

GiteaMirror commented

2026-04-28 21:02:45 -05:00

@rick-github commented on GitHub (Jan 7, 2025):

There are two components that take up VRAM - context and weights. The usual ways of reducing context size (num_ctx, OLLAMA_NUM_PARALLEL, OLLAMA_FLASH_ATTENTION) won't help because you are already using the minimum context. Other models (eg llama3.2:3b) come in a variety of quantizations which can be used to reduce the size of the weights. The default quant for llama3.2:3b is q4-K_M which is 2G, but the size can be as low as 1.4G with the q2_K quant. Unfortunately llama3.2-vision doesn't offer anything smaller than q4_K_M at 7.9G. I haven't tried this, but in theory could take the base model and quantize it yourself to something smaller. However, I don't think the tool that I usually use for quantizing models (llama.cpp) supports the llama3.2-vision architecture (mllama), so you'd need to find suitable tools.

One last alternative would be to force llama.cpp to load all layers into VRAM and then have the GPU overflow to RAM, rather than having ollama decide on the RAM allocation. This will maximize VRAM usage at the cost of a performance penalty for the layers residing in RAM. However, because you can almost fit the model in VRAM, only a few layers will spill into RAM, and the penalty might not be noticeable. You can force this by setting num_gpu to the number of layers (or really any number greater than or equal to the layer count). See here for ways to adjust num_gpu.

@rick-github commented on GitHub (Jan 7, 2025): There are two components that take up VRAM - context and weights. The usual ways of reducing context size (`num_ctx`, `OLLAMA_NUM_PARALLEL`, `OLLAMA_FLASH_ATTENTION`) won't help because you are already using the minimum context. Other models (eg llama3.2:3b) come in a variety of quantizations which can be used to reduce the size of the weights. The default quant for llama3.2:3b is [q4-K_M](https://ollama.com/library/llama3.2:3b-instruct-q4_K_M) which is 2G, but the size can be as low as 1.4G with the [q2_K](https://ollama.com/library/llama3.2:3b-instruct-q2_K) quant. Unfortunately llama3.2-vision doesn't offer anything smaller than [q4_K_M](https://ollama.com/library/llama3.2-vision:11b-instruct-q4_K_M) at 7.9G. I haven't tried this, but in theory could take the [base model](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) and quantize it yourself to something smaller. However, I don't think the tool that I usually use for quantizing models ([llama.cpp](https://github.com/ggerganov/llama.cpp)) supports the llama3.2-vision architecture (mllama), so you'd need to find suitable tools. One last alternative would be to force llama.cpp to load all layers into VRAM and then have the GPU overflow to RAM, rather than having ollama decide on the RAM allocation. This will maximize VRAM usage at the cost of a [performance penalty](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900) for the layers residing in RAM. However, because you can almost fit the model in VRAM, only a few layers will spill into RAM, and the penalty might not be noticeable. You can force this by setting `num_gpu` to the number of layers (or really any number greater than or equal to the layer count). See [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) for ways to adjust `num_gpu`.

GiteaMirror commented

2026-04-28 21:02:46 -05:00

@rick-github commented on GitHub (Jan 7, 2025):

@ChandlerHooley

time=2025-01-06T22:07:11.158-06:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dd59afa0-5797-0eb5-41fa-a4e67e77623f library=cuda variant=v12 compute=7.5 driver=12.6 name="NVIDIA GeForce GTX 1650 SUPER" total="4.0 GiB" available="3.2 GiB"
time=2025-01-06T22:07:22.796-06:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[2.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"

Your GPU has 3.2G free. Just the projector (2.8G) and context space (656M) add up to more than this, so there is no way to run llama3.2-vison on your GPU, even with the num_gpu hack from above.

@rick-github commented on GitHub (Jan 7, 2025): @ChandlerHooley ``` time=2025-01-06T22:07:11.158-06:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dd59afa0-5797-0eb5-41fa-a4e67e77623f library=cuda variant=v12 compute=7.5 driver=12.6 name="NVIDIA GeForce GTX 1650 SUPER" total="4.0 GiB" available="3.2 GiB" time=2025-01-06T22:07:22.796-06:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[2.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" ``` Your GPU has 3.2G free. Just the projector (2.8G) and context space (656M) add up to more than this, so there is no way to run llama3.2-vison on your GPU, even with the `num_gpu` hack from above.

GiteaMirror commented

2026-04-28 21:02:46 -05:00

@blueApple12 commented on GitHub (Jan 7, 2025):

Why is my gpu so full? I just built this pc a week ago. Will the full log of nvidia smi help identify what takes all of the vram?

@blueApple12 commented on GitHub (Jan 7, 2025): Why is my gpu so full? I just built this pc a week ago. Will the full log of nvidia smi help identify what takes all of the vram?

GiteaMirror commented

2026-04-28 21:02:46 -05:00

@rick-github commented on GitHub (Jan 7, 2025):

I'm not a Windows user so fine details of process usage escape me. Try this for help: https://saturncloud.io/blog/how-to-find-and-limit-gpu-usage-by-process-in-windows/#finding-gpu-usage-by-process

@rick-github commented on GitHub (Jan 7, 2025): I'm not a Windows user so fine details of process usage escape me. Try this for help: https://saturncloud.io/blog/how-to-find-and-limit-gpu-usage-by-process-in-windows/#finding-gpu-usage-by-process

GiteaMirror commented

2026-04-28 21:02:47 -05:00

@blueApple12 commented on GitHub (Jan 7, 2025):

i really didnt understand, this page, if someone can understand this and help me it would be great.

@blueApple12 commented on GitHub (Jan 7, 2025): i really didnt understand, this page, if someone can understand this and help me it would be great. Tue Jan 7 16:34:55 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 566.36 Driver Version: 566.36 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4070 ... WDDM | 00000000:01:00.0 On | N/A | | 30% 35C P5 15W / 220W | 895MiB / 12282MiB | 28% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1244 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | | 0 N/A N/A 2748 C+G ...air\Corsair iCUE5 Software\iCUE.exe N/A | | 0 N/A N/A 2836 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A | | 0 N/A N/A 5704 C+G ...al\Discord\app-1.0.9175\Discord.exe N/A | | 0 N/A N/A 6336 C+G ...\Cef\CefSharp.BrowserSubprocess.exe N/A | | 0 N/A N/A 11832 C+G ....0_x64__8wekyb3d8bbwe\XboxPcApp.exe N/A | | 0 N/A N/A 12004 C+G ...crosoft\Edge\Application\msedge.exe N/A | | 0 N/A N/A 14244 C+G ...6.0_x64__cv1g1gvanyjgm\WhatsApp.exe N/A | | 0 N/A N/A 15152 C+G ...oogle\Chrome\Application\chrome.exe N/A | | 0 N/A N/A 15476 C+G ...n\131.0.2903.112\msedgewebview2.exe N/A | | 0 N/A N/A 21852 C+G ...n\NVIDIA app\CEF\NVIDIA Overlay.exe N/A | | 0 N/A N/A 23444 C+G ...ces\Razer Central\Razer Central.exe N/A | | 0 N/A N/A 23724 C+G ...n\131.0.2903.112\msedgewebview2.exe N/A | | 0 N/A N/A 23812 C+G ...n\NVIDIA app\CEF\NVIDIA Overlay.exe N/A | | 0 N/A N/A 24456 C+G ...siveControlPanel\SystemSettings.exe N/A | | 0 N/A N/A 24876 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | | 0 N/A N/A 27828 C+G ...x64__97hta09mmv6hy\Build\Lively.exe N/A | | 0 N/A N/A 30528 C+G ... Synapse 3 Host\Razer Synapse 3.exe N/A | | 0 N/A N/A 32416 C+G ...nr4m\radeonsoftware\AMDRSSrcExt.exe N/A | | 0 N/A N/A 34800 C+G ...m\radeonsoftware\RadeonSoftware.exe N/A | | 0 N/A N/A 35960 C+G ...oogle\Chrome\Application\chrome.exe N/A | | 0 N/A N/A 36928 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 38648 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 41612 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A | | 0 N/A N/A 41720 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 42312 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A | | 0 N/A N/A 44232 C+G C:\Windows\System32\ShellHost.exe N/A | | 0 N/A N/A 45884 C+G ...s\System32\ApplicationFrameHost.exe N/A | | 0 N/A N/A 49344 C+G ...Programs\Microsoft VS Code\Code.exe N/A | | 0 N/A N/A 51312 C+G ...__8wekyb3d8bbwe\WindowsTerminal.exe N/A | +-----------------------------------------------------------------------------------------+

GiteaMirror commented

2026-04-28 21:02:48 -05:00

@rick-github commented on GitHub (Jan 7, 2025):

Unfortunately this is not a really useful output, as it doesn't contain the VRAM usage and the process names are incomplete, so it's not possible to identify the large users of VRAM. But there may be low hanging fruit. Does your machine have an integrated graphics processor? If so, it may be possible to set that as the default GPU for the system in the BIOS, so that when Windows starts it doesn't allocate VRAM from the 4070. The alternative is to set the preferred GPU on a program by program basis as discussed here.

@rick-github commented on GitHub (Jan 7, 2025): Unfortunately this is not a really useful output, as it doesn't contain the VRAM usage and the process names are incomplete, so it's not possible to identify the large users of VRAM. But there may be low hanging fruit. Does your machine have an integrated graphics processor? If so, it may be possible to set that as the default GPU for the system in the BIOS, so that when Windows starts it doesn't allocate VRAM from the 4070. The alternative is to set the preferred GPU on a program by program basis as discussed [here](https://www.intel.com/content/www/us/en/support/articles/000090168/graphics.html).

GiteaMirror commented

2026-04-28 21:02:49 -05:00

@blueApple12 commented on GitHub (Jan 7, 2025):

I completely disabled the integrated graphics, may it cause it? Because I thought it might use my integrated graphics instead of my gpu.

@blueApple12 commented on GitHub (Jan 7, 2025): I completely disabled the integrated graphics, may it cause it? Because I thought it might use my integrated graphics instead of my gpu.

GiteaMirror commented

2026-04-28 21:02:49 -05:00

@rick-github commented on GitHub (Jan 7, 2025):

ollama will not use integrated graphics, there is very little support for those types of GPUs. Enable it, make it the default.

@rick-github commented on GitHub (Jan 7, 2025): ollama will not use integrated graphics, there is very little support for those types of GPUs. Enable it, make it the default.

GiteaMirror commented

2026-04-28 21:02:50 -05:00

@kreier commented on GitHub (Jan 9, 2025):

Your RAM should be sufficient. This is really strange. I found conflicting statements about your available VRAM in your logfile https://github.com/ollama/ollama/issues/8310#issuecomment-2573402746 just a second apart:

time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"

I observed a similar behavior to your 4060 with two GTX 1060 6GB. Starting llama3.2-vision runs 100% on the CPU (check with ollama ps after leaving with \bye). Then I started a similar sized model phi4 and it went 100% to the GPU, split between both graphics cards and used 11GB. I tried a few others, and the vision model was the outlier.

Can you try other models (like phi4), that should fit into your VRAM to narrow down this behavior? You have more than 10 GB free, and llama3.2-vision needs usually a little more than 9 GB, even though the files are only 7.9 GB large. And even when it can't fit completely into the VRAM, it should split some layers of and process them in regular RAM. With my 8 GB card under Windows I get the following result after running ollama, exiting it and calling ollama ps:

mk@i3:~$ ollama ps
NAME                      ID              SIZE     PROCESSOR          UNTIL
llama3.2-vision:latest    38107a0cd119    12 GB    43%/57% CPU/GPU    3 minutes from now

I checked my logfile, and got a statement layers.offload=7 where you got a zero. Don't know the reason yet:

time=2025-01-10T01:13:34.533+07:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-ef7243b2-74b9-5dfb-ed58-a2f776e2ae78 library=cuda variant=v12 compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3070 Ti" total="8.0 GiB" available="6.9 GiB"
time=2025-01-10T01:19:07.132+07:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=7 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.0 GiB" memory.required.partial="6.8 GiB" memory.required.kv="656.2 MiB" memo>

@kreier commented on GitHub (Jan 9, 2025): Your RAM should be sufficient. This is really strange. I found conflicting statements about your available VRAM in your logfile https://github.com/ollama/ollama/issues/8310#issuecomment-2573402746 just a second apart: ``` time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB" time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" ``` I observed a similar behavior to your 4060 with two GTX 1060 6GB. Starting `llama3.2-vision` runs 100% on the CPU (check with `ollama ps` after leaving with `\bye`). Then I started a similar sized model `phi4` and it went 100% to the GPU, split between both graphics cards and used 11GB. I tried a few others, and the vision model was the outlier. Can you try other models (like phi4), that should fit into your VRAM to narrow down this behavior? You have more than 10 GB free, and `llama3.2-vision` needs usually a little more than 9 GB, even though the files are only 7.9 GB large. And even when it can't fit completely into the VRAM, it should split some layers of and process them in regular RAM. With my 8 GB card under Windows I get the following result after running ollama, exiting it and calling `ollama ps`: ``` mk@i3:~$ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.2-vision:latest 38107a0cd119 12 GB 43%/57% CPU/GPU 3 minutes from now ``` I checked my logfile, and got a statement `layers.offload=7` where you got a zero. Don't know the reason yet: ``` time=2025-01-10T01:13:34.533+07:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-ef7243b2-74b9-5dfb-ed58-a2f776e2ae78 library=cuda variant=v12 compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3070 Ti" total="8.0 GiB" available="6.9 GiB" time=2025-01-10T01:19:07.132+07:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=7 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.0 GiB" memory.required.partial="6.8 GiB" memory.required.kv="656.2 MiB" memo> ```

GiteaMirror commented

2026-04-28 21:02:50 -05:00

@kreier commented on GitHub (Jan 9, 2025):

A similar behavior was noticed with 6GB VRAM graphics cards in November 2024: https://github.com/ollama/ollama/issues/7509 It works with my 8GB card and the problem described here is for more than 10 GB available VRAM.

@kreier commented on GitHub (Jan 9, 2025): A similar behavior was noticed with 6GB VRAM graphics cards in November 2024: https://github.com/ollama/ollama/issues/7509 It works with my 8GB card and the problem described here is for more than 10 GB available VRAM.

GiteaMirror commented

2026-04-28 21:02:50 -05:00

@rick-github commented on GitHub (Jan 10, 2025):

Your RAM should be sufficient.

It's sufficient if there are no other processes using the GPU. Switching to integrated graphics will help.

I found conflicting statements about your available VRAM in your logfile #8310 (comment) just a second apart:

These are 67 seconds apart.

Starting llama3.2-vision runs 100% on the CPU (check with ollama ps after leaving with \bye). Then I started a similar sized model phi4 and it went 100% to the GPU,

Vision models have extra requirements that make it harder to fit them in limited VRAM as discussed in https://github.com/ollama/ollama/issues/7509#issuecomment-2457887328.

@rick-github commented on GitHub (Jan 10, 2025): > Your RAM should be sufficient. It's sufficient if there are no other processes using the GPU. Switching to integrated graphics will help. > I found conflicting statements about your available VRAM in your logfile #8310 (comment) just a second apart: These are 67 seconds apart. > Starting llama3.2-vision runs 100% on the CPU (check with ollama ps after leaving with \bye). Then I started a similar sized model phi4 and it went 100% to the GPU, Vision models have extra requirements that make it harder to fit them in limited VRAM as discussed in https://github.com/ollama/ollama/issues/7509#issuecomment-2457887328.

GiteaMirror commented

2026-04-28 21:02:51 -05:00

@kreier commented on GitHub (Jan 11, 2025):

Thanks @rick-github for the feedback and double-checking my comment. Sorry for the mistake, I should learn how to read the time!

I tested this scenario again, and I'm not sure if llama3.2-vision will fit entirely into 12GB of VRAM. The use of the integrated graphics might be the only way. As pointed out by others above.

First I tried to run llama3.2-vision just on the CPU. To do this I set the parameter /set parameter num_gpu 0 after starting ollama, and then gave it a prompt to process. I checked the RAM usage afterwards with ollama ps and got a result of 11 GB. Which is less than 12GB, so a 3060 with 12GB might work. Surprisingly when using the 8 GB GPU partially, the speed went down from 5.4 token/s to 4.5 token/s. The stated utilization from ollama ps was 43%/57% CPU/GPU but I think this only relates to RAM, not token generation speed. The GPU seems to be used only for the projector (see below) and the token generation is done entirely by the CPU.

On another system with a 8GB card and a 6GB card I got llama3.2-vision almost entirely into the VRAM, just 4% were still processed by the CPU. It resulted in 15 token/s. Following the advice given in this thread I switched to my iGPU of my processor and gained a few Megabyte on the larger card - and finally got 100% GPU utilization. The responsiveness increased by 55% to 23.3 token/s! That's the reward to have all layers in the fast GPU memory!

ollama_journal_GPU_8.0_6.0.txt

Here I checked the combined unitization of the GPUs with nvtop. The larger used 6.937 GiB and the smaller 4.583 GiB. Combined this equals 11.52 GiB. There is not much space left if this should fit into a 12 GB card. ollama ps even reported 13 GB RAM used. The distribution of procession power was heavily skewed, the big 8GB card used only 12% of the GPU power, while the smaller 6GB card got up to 84%.

One thing I still don't unterstand is how the memory requirements for the projector combined equal to something very close to 8GB, so any system with graphics cards smaller than 8GB might not even split the models to use the combined VRAM. It was already stated that the vision model is unique in this regard, and need one continous chunk of RAM to opereate. The logfile states:

projector.weights="1.8 GiB"
projector.graph="2.8 GiB"
layers.offload=41 layers.split=13,28
memory.available="[7.6 GiB 5.9 GiB]"
memory.required.full="10.6 GiB"
memory.required.kv="656.2 MiB"
memory.required.allocations="[7.6 GiB 5.4 GiB]"

I can't see how 1.8 + 2.8 results in something 6.837 GiB, even if I add the 656 MiB for kv. Can someone explain the math to me? When using the system with only one 8 GB card the logfile (see above) states that only 7 of the 41 layers were offloaded to the GPU:

Jan 10 01:19:07 i3 ollama[230]: time=2025-01-10T01:19:07.132+07:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=7 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.0 GiB" memory.required.partial="6.8 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
Jan 10 01:19:07 i3 ollama[230]: time=2025-01-10T01:19:07.133+07:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 7 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 4 --parallel 1 --port 35025"
...
Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors: offloading 7 repeating layers to GPU
Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors: offloaded 7/41 layers to GPU
Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors:   CPU_Mapped model buffer size =  5679.33 MiB
Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors:        CUDA0 model buffer size =   911.00 MiB

This seems to be the "minimum pieces of the model that have to be loaded in VRAM in their entirety for anything to run on the GPU" that @jessegross mentioned in the issue 7509 in November 6, 2024. https://github.com/ollama/ollama/issues/7509#issuecomment-2457887328

@kreier commented on GitHub (Jan 11, 2025): Thanks @rick-github for the feedback and double-checking my comment. Sorry for the mistake, I should learn how to read the time! I tested this scenario again, and I'm not sure if __llama3.2-vision__ will fit entirely into 12GB of VRAM. The use of the integrated graphics might be the only way. As pointed out by others above. First I tried to run llama3.2-vision just on the CPU. To do this I set the parameter `/set parameter num_gpu 0` after starting ollama, and then gave it a prompt to process. I checked the RAM usage afterwards with `ollama ps` and got a result of __11 GB__. Which is less than 12GB, so a 3060 with 12GB might work. Surprisingly when using the 8 GB GPU partially, the speed _went down_ from 5.4 token/s to 4.5 token/s. The stated utilization from `ollama ps` was 43%/57% CPU/GPU but I think this only relates to RAM, not token generation speed. The GPU seems to be used only for the projector (see below) and the token generation is done entirely by the CPU. - [ollama_journal_cpu_only.txt](https://github.com/user-attachments/files/18386711/ollama_journal_cpu_only.txt) - [ollama_journal_GPU_8.0.txt](https://github.com/user-attachments/files/18386712/ollama_journal_GPU_8.0.txt) On another system with a 8GB card and a 6GB card I got llama3.2-vision almost entirely into the VRAM, just 4% were still processed by the CPU. It resulted in 15 token/s. Following the advice given in this thread I switched to my iGPU of my processor and gained a few Megabyte on the larger card - and finally got 100% GPU utilization. The responsiveness increased by 55% to 23.3 token/s! That's the reward to have all layers in the fast GPU memory! - [ollama_journal_GPU_8.0_6.0.txt](https://github.com/user-attachments/files/18386713/ollama_journal_GPU_8.0_6.0.txt) Here I checked the combined unitization of the GPUs with `nvtop`. The larger used 6.937 GiB and the smaller 4.583 GiB. Combined this equals __11.52 GiB__. There is not much space left if this should fit into a 12 GB card. `ollama ps` even reported 13 GB RAM used. The distribution of procession power was heavily skewed, the big 8GB card used only 12% of the GPU power, while the smaller 6GB card got up to 84%. One thing I still don't unterstand is how the memory requirements for the projector combined equal to something very close to 8GB, so any system with graphics cards smaller than 8GB might not even split the models to use the combined VRAM. It was already stated that the vision model is unique in this regard, and need one continous chunk of RAM to opereate. The logfile states: ``` projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.offload=41 layers.split=13,28 memory.available="[7.6 GiB 5.9 GiB]" memory.required.full="10.6 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[7.6 GiB 5.4 GiB]" ``` I can't see how 1.8 + 2.8 results in something 6.837 GiB, even if I add the 656 MiB for kv. Can someone explain the math to me? When using the system with only one 8 GB card [the logfile](https://github.com/user-attachments/files/18386712/ollama_journal_GPU_8.0.txt) (see above) states that only 7 of the 41 layers were offloaded to the GPU: ``` Jan 10 01:19:07 i3 ollama[230]: time=2025-01-10T01:19:07.132+07:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=7 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.0 GiB" memory.required.partial="6.8 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" Jan 10 01:19:07 i3 ollama[230]: time=2025-01-10T01:19:07.133+07:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 7 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 4 --parallel 1 --port 35025" ... Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors: offloading 7 repeating layers to GPU Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors: offloaded 7/41 layers to GPU Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors: CPU_Mapped model buffer size = 5679.33 MiB Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors: CUDA0 model buffer size = 911.00 MiB ``` This seems to be the "minimum pieces of the model that have to be loaded in VRAM in their entirety for anything to run on the GPU" that @jessegross mentioned in the issue 7509 in November 6, 2024. https://github.com/ollama/ollama/issues/7509#issuecomment-2457887328

GiteaMirror commented

2026-04-28 21:02:51 -05:00

@blueApple12 commented on GitHub (Jan 15, 2025):

llama3.2-vision:latest 085a1fdae525 11 GB 100% CPU 4 minutes from now

I ran /set parameter num_gpu 0. it doesnt work why?

@blueApple12 commented on GitHub (Jan 15, 2025): llama3.2-vision:latest 085a1fdae525 11 GB 100% CPU 4 minutes from now I ran /set parameter num_gpu 0. it doesnt work why?

GiteaMirror commented

2026-04-28 21:02:52 -05:00

@rick-github commented on GitHub (Jan 15, 2025):

num_gpu:0 means load 0 layers into the GPU.

@rick-github commented on GitHub (Jan 15, 2025): `num_gpu:0` means load 0 layers into the GPU.

GiteaMirror commented

2026-04-28 21:02:53 -05:00

@kreier commented on GitHub (Jan 16, 2025):

llama3.2-vision:latest 085a1fdae525 11 GB 100% CPU 4 minutes from now

I ran /set parameter num_gpu 0. it doesnt work why?

It actually works as intended. It sets the number of GPUs to be utilized to zero and runs entirely on the CPU. And ollama ps just confirmed that all is done on the CPU. I ran this to see the minimum required continous memory to run llama3.2-vision. The result of 11GB is lower than the VRAM of your GPU with 12 GB, so it might fit. If some layers are split to run on GPU and some on CPU the total memory demand increases, here usually to 13 GB.

But with your 12 GB card at least some layers would be processed by the GPU if at least 8 GB are available. If you don't run a game in the background this should be possible. Can you try again to close all applications and just run ollama to see if at least a few (maybe the first 7) layers will be exported to the GPU (needing 8 GB)? Or if you connect your monitor to the iGPU? Then it could be possible to run the complete llama3.2-vision on the GPU - at least according to my calculations.

@kreier commented on GitHub (Jan 16, 2025): > llama3.2-vision:latest 085a1fdae525 11 GB 100% CPU 4 minutes from now > > I ran /set parameter num_gpu 0. it doesnt work why? It actually works as intended. It sets the number of GPUs to be utilized to zero and runs entirely on the CPU. And `ollama ps` just confirmed that all is done on the CPU. I ran this to see the minimum required continous memory to run __llama3.2-vision__. The result of 11GB is lower than the VRAM of your GPU with 12 GB, so it might fit. If some layers are split to run on GPU and some on CPU the total memory demand increases, here usually to 13 GB. But with your 12 GB card at least some layers would be processed by the GPU if at least 8 GB are available. If you don't run a game in the background this should be possible. Can you try again to close all applications and just run ollama to see if at least a few (maybe the first 7) layers will be exported to the GPU (needing 8 GB)? Or if you connect your monitor to the iGPU? Then it could be possible to run the complete llama3.2-vision on the GPU - at least according to my calculations.

GiteaMirror commented

2026-04-28 21:02:54 -05:00

@blueApple12 commented on GitHub (Jan 16, 2025):

I don't want this to be entirely on gpu, but when I run it normally it still doesn't utilize my gpu.

@blueApple12 commented on GitHub (Jan 16, 2025): I don't want this to be entirely on gpu, but when I run it normally it still doesn't utilize my gpu.

GiteaMirror commented

2026-04-28 21:02:54 -05:00

@rick-github commented on GitHub (Jan 16, 2025):

Have you switched to iGPU? Can you supply server logs?

@rick-github commented on GitHub (Jan 16, 2025): Have you switched to iGPU? Can you supply server logs?

GiteaMirror commented

2026-04-28 21:02:54 -05:00

@kreier commented on GitHub (Jan 16, 2025):

I don't want this to be entirely on gpu, but when I run it normally it still doesn't utilize my gpu.

Can you be a little more specific? My card has only 8GB and I'm using Windows, too. First I check the free VRAM with nvidia-smi:

PS C:\Users\matth> nvidia-smi
Thu Jan 16 13:58:03 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.03                 Driver Version: 566.03         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070 Ti   WDDM  |   00000000:01:00.0  On |                  N/A |
| 33%   33C    P8             22W /  290W |     930MiB /   8192MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Only 950 MiB are used by the system for Chrome, Youtube videos, etc.

Then I start regular ollama run llama3.2-vision and have a conversation. Leaving with \bye and checking utilization and VRAM again:

eval rate:            4.40 tokens/s
>>> /bye
PS C:\Users\matth> ollama ps
NAME                      ID              SIZE     PROCESSOR          UNTIL
llama3.2-vision:latest    085a1fdae525    12 GB    48%/52% CPU/GPU    4 minutes from now
PS C:\Users\matth> nvidia-smi
Thu Jan 16 14:10:29 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.03                 Driver Version: 566.03         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070 Ti   WDDM  |   00000000:01:00.0  On |                  N/A |
| 34%   35C    P8             23W /  290W |    6920MiB /   8192MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Now some 6920 MiB of the GPU VRAM are used, the 52% listed by ollama after ollama ps. What are your values?

@kreier commented on GitHub (Jan 16, 2025): > I don't want this to be entirely on gpu, but when I run it normally it still doesn't utilize my gpu. Can you be a little more specific? My card has only 8GB and I'm using Windows, too. First I check the free VRAM with `nvidia-smi`: ``` PS C:\Users\matth> nvidia-smi Thu Jan 16 13:58:03 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 566.03 Driver Version: 566.03 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3070 Ti WDDM | 00000000:01:00.0 On | N/A | | 33% 33C P8 22W / 290W | 930MiB / 8192MiB | 4% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` Only 950 MiB are used by the system for Chrome, Youtube videos, etc. Then I start regular `ollama run llama3.2-vision` and have a conversation. Leaving with `\bye` and checking utilization and VRAM again: ``` eval rate: 4.40 tokens/s >>> /bye PS C:\Users\matth> ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.2-vision:latest 085a1fdae525 12 GB 48%/52% CPU/GPU 4 minutes from now PS C:\Users\matth> nvidia-smi Thu Jan 16 14:10:29 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 566.03 Driver Version: 566.03 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3070 Ti WDDM | 00000000:01:00.0 On | N/A | | 34% 35C P8 23W / 290W | 6920MiB / 8192MiB | 5% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` Now some 6920 MiB of the GPU VRAM are used, the 52% listed by ollama after `ollama ps`. What are your values?

GiteaMirror commented

2026-04-28 21:02:55 -05:00

@blueApple12 commented on GitHub (Jan 17, 2025):

what is the recomended num gpu for this?

@blueApple12 commented on GitHub (Jan 17, 2025): what is the recomended num gpu for this?

GiteaMirror commented

2026-04-28 21:02:55 -05:00

@rick-github commented on GitHub (Jan 17, 2025):

ollama will compute num_gpu and show it in the log, search for layers.offload. You can override this in the API call or Modefile if you think ollama is wrong. The maximum value is layers.model.

@rick-github commented on GitHub (Jan 17, 2025): ollama will compute `num_gpu` and show it in the log, search for `layers.offload`. You can override this in the API call or Modefile if you think ollama is wrong. The maximum value is `layers.model`.

GiteaMirror commented

2026-04-28 21:02:55 -05:00

@kreier commented on GitHub (Jan 17, 2025):

Going over the logfile you posted https://github.com/ollama/ollama/issues/8310#issuecomment-2573402746 it looks like your GPU was utilized running llama3.2-vision and llama3.2 3B instruct. Here a few timestamps and excerpts:

time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: vision using CPU backend
time=2025-01-05T16:51:35.683+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
time=2025-01-05T16:52:26.358+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=33 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.3 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
time=2025-01-05T16:52:35.902+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.53 seconds"
time=2025-01-05T16:55:02.891+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
time=2025-01-05T16:56:12.488+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=34 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.5 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.5 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
time=2025-01-05T17:17:13.033+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/41 layers to GPU
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: vision using CUDA backend
time=2025-01-05T17:17:16.304+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.26 seconds"
time=2025-01-05T17:32:26.882+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.9 GiB" memory.required.partial="2.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[2.9 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB"
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
time=2025-01-05T17:32:29.130+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: vision using CPU backend
time=2025-01-05T17:44:58.585+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
...

It looks like you were going back and forth in using the larger Llama-3.2-11B-Vision-Instruct model and the smaller Llama 3.2 3B Instruct model. And as the logfile shows, all 29 layers of the smaller model were offloaded into the GPU. If you would have checked with ollama ps you would have gotten 100% GPU while using 2.9 GiB.

As for the vision model, depending on the available VRAM it was partially loaded into your GPU in some instances:

time=2025-01-05T16:44:53.336+02:00 layers.model=41 layers.offload=0
time=2025-01-05T16:52:26.358+02:00 layers.model=41 layers.offload=33
time=2025-01-05T16:56:12.488+02:00 layers.model=41 layers.offload=34
time=2025-01-05T17:17:13.033+02:00 layers.model=41 layers.offload=32
time=2025-01-05T17:32:29.130+02:00 layers.model=41 layers.offload=0
time=2025-01-05T17:45:21.112+02:00 layers.model=41 layers.offload=35 layers.split="" memory.available="[10.7 GiB]"
time=2025-01-06T17:54:18.421+02:00 layers.model=41 layers.offload=36 layers.split="" memory.available="[10.8 GiB]"

The last one was close! Freeing your VRAM might have fill all 41 layers. Or use of the iGPU. Interestingly if the model is split between GPU and CPU the split parameter states layers.split="". Only when splitting between several GPUs you get the distribution.

So the issue you posted here seems only have applied at 2025-01-05T16:44:53.336+02:00 and 2025-01-05T17:32:29.130+02:00 when your GPU was running out of VRAM. But ollama was using your GPU before and after that.

@kreier commented on GitHub (Jan 17, 2025): Going over the logfile you posted https://github.com/ollama/ollama/issues/8310#issuecomment-2573402746 it looks like **your GPU was utilized** running `llama3.2-vision` and `llama3.2 3B instruct`. Here a few timestamps and excerpts: ``` time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB" time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: vision using CPU backend time=2025-01-05T16:51:35.683+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct time=2025-01-05T16:52:26.358+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=33 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.3 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct time=2025-01-05T16:52:35.902+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.53 seconds" time=2025-01-05T16:55:02.891+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct time=2025-01-05T16:56:12.488+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=34 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.5 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.5 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct time=2025-01-05T17:17:13.033+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloaded 32/41 layers to GPU mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: vision using CUDA backend time=2025-01-05T17:17:16.304+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.26 seconds" time=2025-01-05T17:32:26.882+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.9 GiB" memory.required.partial="2.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[2.9 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB" llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct time=2025-01-05T17:32:29.130+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: vision using CPU backend time=2025-01-05T17:44:58.585+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" ... ``` It looks like you were going back and forth in using the larger `Llama-3.2-11B-Vision-Instruct` model and the smaller `Llama 3.2 3B Instruct` model. And as the logfile shows, all 29 layers of the smaller model were offloaded into the GPU. If you would have checked with `ollama ps` you would have gotten 100% GPU while using 2.9 GiB. As for the vision model, depending on the available VRAM it was partially loaded into your GPU in some instances: - time=2025-01-05T16:44:53.336+02:00 layers.model=41 layers.offload=0 - time=2025-01-05T16:52:26.358+02:00 layers.model=41 layers.offload=33 - time=2025-01-05T16:56:12.488+02:00 layers.model=41 layers.offload=34 - time=2025-01-05T17:17:13.033+02:00 layers.model=41 layers.offload=32 - time=2025-01-05T17:32:29.130+02:00 layers.model=41 layers.offload=0 - time=2025-01-05T17:45:21.112+02:00 layers.model=41 layers.offload=35 layers.split="" memory.available="[10.7 GiB]" - time=2025-01-06T17:54:18.421+02:00 layers.model=41 layers.offload=36 layers.split="" memory.available="[10.8 GiB]" The last one was close! Freeing your VRAM might have fill all 41 layers. Or use of the iGPU. Interestingly if the model is split between GPU and CPU the split parameter states `layers.split=""`. Only when splitting between several GPUs you get the distribution. So the issue you posted here seems only have applied at 2025-01-05T16:44:53.336+02:00 and 2025-01-05T17:32:29.130+02:00 when your GPU was running out of VRAM. But ollama was using your GPU before and after that.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

dhiltgen/llama-runner

hoyyeva/anthropic-local-image-path

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#51832