[GH-ISSUE #10740] 0.6.8 already have weird memory usage (also 0.7.0) #7054

Closed
opened 2026-04-12 18:58:19 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @Fade78 on GitHub (May 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10740

What is the issue?

Here is loaded regular (4QM) Qwen3:32b with 19000 context length. I crafted this context length so it goes right into my three dedicated 4060TIs (16GB). So I know that if I increase a little the context it will go to 48GB then be split on GPU/CPU.

There is only one problem. The VRAM used is not actually 48GB. It's 70% of 48GB on 0.6.8 and I had another case with 50% on 0.7.0. It means that ollama will split on the CPU even if there is VRAM available, dramatically slowing down the inference speed.

Image

So here you can see: ollama reports 47GB usage and, a long time ago (previous versions), it would have mean almost 100% VRAM usage on my 3x16GB rig.

Again, if I increase the context, it will split, that's how I fine tune the context length.

Here are some logs:

[GIN] 2025/05/16 - 15:44:06 | 500 | 49.226007283s |   192.168.10.11 | POST     "/api/chat"
2025/05/16 15:44:07 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:36h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-05-16T15:44:07.744Z level=INFO source=images.go:463 msg="total blobs: 335"
time=2025-05-16T15:44:07.746Z level=INFO source=images.go:470 msg="total unused blobs removed: 0"
time=2025-05-16T15:44:07.747Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)"
time=2025-05-16T15:44:07.747Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-16T15:44:08.171Z level=INFO source=types.go:130 msg="inference compute" id=GPU-abd5d393-e831-c64e-d4eb-1c8797fc1fbf library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB"
time=2025-05-16T15:44:08.172Z level=INFO source=types.go:130 msg="inference compute" id=GPU-9fea6450-0337-d707-5f2f-003347856eed library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB"
time=2025-05-16T15:44:08.172Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a5324074-2159-7d4d-b32e-31afd9176c45 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB"
time=2025-05-16T15:44:08.432Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-16T15:44:08.738Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-16T15:44:08.744Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-05-16T15:44:08.744Z level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0
time=2025-05-16T15:44:08.745Z level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0
time=2025-05-16T15:44:08.745Z level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0
time=2025-05-16T15:44:08.746Z level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0
time=2025-05-16T15:44:08.746Z level=INFO source=sched.go:770 msg="new model will fit in available VRAM, loading" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 library=cuda parallel=1 required="44.0 GiB"
time=2025-05-16T15:44:09.043Z level=INFO source=server.go:106 msg="system memory" total="62.8 GiB" free="60.6 GiB" free_swap="0 B"
time=2025-05-16T15:44:09.043Z level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0
time=2025-05-16T15:44:09.043Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split=22,22,21 memory.available="[15.5 GiB 15.5 GiB 15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="44.0 GiB" memory.required.partial="44.0 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[14.6 GiB 15.3 GiB 14.2 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="6.2 GiB" memory.graph.partial="6.2 GiB"
llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 32B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 25600
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 64
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  257 tensors
llama_model_loader: - type  f16:   64 tensors
llama_model_loader: - type q4_K:  353 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.81 GiB (4.93 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen3 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-05-16T15:44:09.143Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 19000 --batch-size 512 --n-gpu-layers 65 --threads 16 --parallel 1 --tensor-split 22,22,21 --port 36533"
time=2025-05-16T15:44:09.143Z level=INFO source=sched.go:452 msg="loaded runners" count=1
time=2025-05-16T15:44:09.143Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
time=2025-05-16T15:44:09.143Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-16T15:44:09.150Z level=INFO source=runner.go:853 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-05-16T15:44:09.331Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-05-16T15:44:09.332Z level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:36533"
time=2025-05-16T15:44:09.395Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 15831 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4060 Ti) - 15831 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 4060 Ti) - 15831 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 32B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 25600
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 64
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  257 tensors
llama_model_loader: - type  f16:   64 tensors
llama_model_loader: - type q4_K:  353 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.81 GiB (4.93 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 25600
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen3 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[GIN] 2025/05/16 - 15:44:10 | 200 |       28.53µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/16 - 15:44:10 | 200 |      102.47µs |       127.0.0.1 | GET      "/api/ps"
load_tensors: offloading 64 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors:        CUDA0 model buffer size =  6300.10 MiB
load_tensors:        CUDA1 model buffer size =  6171.19 MiB
load_tensors:        CUDA2 model buffer size =  6371.11 MiB
load_tensors:   CPU_Mapped model buffer size =   417.30 MiB
[GIN] 2025/05/16 - 15:44:12 | 200 |      24.929µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/16 - 15:44:12 | 200 |      34.071µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/05/16 - 15:44:14 | 200 |       25.22µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/16 - 15:44:14 | 200 |       25.99µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/05/16 - 15:44:16 | 200 |      22.209µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/16 - 15:44:16 | 200 |       23.17µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/05/16 - 15:44:18 | 200 |       26.36µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/16 - 15:44:18 | 200 |      27.531µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/05/16 - 15:44:20 | 200 |       26.81µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/16 - 15:44:20 | 200 |       39.83µs |       127.0.0.1 | GET      "/api/ps"
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 19000
llama_context: n_ctx_per_seq = 19000
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (19000) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.60 MiB
init: kv_size = 19008, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
init:      CUDA0 KV buffer size =  1633.50 MiB
init:      CUDA1 KV buffer size =  1633.50 MiB
init:      CUDA2 KV buffer size =  1485.00 MiB
llama_context: KV self size  = 4752.00 MiB, K (f16): 2376.00 MiB, V (f16): 2376.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA0 compute buffer size =  2616.51 MiB
llama_context:      CUDA1 compute buffer size =  2616.51 MiB
llama_context:      CUDA2 compute buffer size =  2616.52 MiB
llama_context:  CUDA_Host compute buffer size =   158.52 MiB
llama_context: graph nodes  = 2438
llama_context: graph splits = 4
time=2025-05-16T15:44:21.182Z level=INFO source=server.go:628 msg="llama runner started in 12.04 seconds"

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @Fade78 on GitHub (May 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10740 ### What is the issue? Here is loaded regular (4QM) Qwen3:32b with 19000 context length. I crafted this context length so it goes right into my three dedicated 4060TIs (16GB). So I know that if I increase a little the context it will go to 48GB then be split on GPU/CPU. There is only one problem. The VRAM used is not actually 48GB. It's 70% of 48GB on 0.6.8 and I had another case with 50% on 0.7.0. It means that ollama will split on the CPU even if there is VRAM available, dramatically slowing down the inference speed. ![Image](https://github.com/user-attachments/assets/100d4e74-cc73-4db0-a988-bc7751593420) So here you can see: ollama reports 47GB usage and, a long time ago (previous versions), it would have mean almost 100% VRAM usage on my 3x16GB rig. Again, if I increase the context, it will split, that's how I fine tune the context length. Here are some logs: ``` [GIN] 2025/05/16 - 15:44:06 | 500 | 49.226007283s | 192.168.10.11 | POST "/api/chat" 2025/05/16 15:44:07 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:36h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-05-16T15:44:07.744Z level=INFO source=images.go:463 msg="total blobs: 335" time=2025-05-16T15:44:07.746Z level=INFO source=images.go:470 msg="total unused blobs removed: 0" time=2025-05-16T15:44:07.747Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)" time=2025-05-16T15:44:07.747Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-05-16T15:44:08.171Z level=INFO source=types.go:130 msg="inference compute" id=GPU-abd5d393-e831-c64e-d4eb-1c8797fc1fbf library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB" time=2025-05-16T15:44:08.172Z level=INFO source=types.go:130 msg="inference compute" id=GPU-9fea6450-0337-d707-5f2f-003347856eed library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB" time=2025-05-16T15:44:08.172Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a5324074-2159-7d4d-b32e-31afd9176c45 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB" time=2025-05-16T15:44:08.432Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-16T15:44:08.738Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-16T15:44:08.744Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-05-16T15:44:08.744Z level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0 time=2025-05-16T15:44:08.745Z level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0 time=2025-05-16T15:44:08.745Z level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0 time=2025-05-16T15:44:08.746Z level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0 time=2025-05-16T15:44:08.746Z level=INFO source=sched.go:770 msg="new model will fit in available VRAM, loading" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 library=cuda parallel=1 required="44.0 GiB" time=2025-05-16T15:44:09.043Z level=INFO source=server.go:106 msg="system memory" total="62.8 GiB" free="60.6 GiB" free_swap="0 B" time=2025-05-16T15:44:09.043Z level=WARN source=ggml.go:152 msg="key not found" key=qwen3.vision.block_count default=0 time=2025-05-16T15:44:09.043Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split=22,22,21 memory.available="[15.5 GiB 15.5 GiB 15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="44.0 GiB" memory.required.partial="44.0 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[14.6 GiB 15.3 GiB 14.2 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="6.2 GiB" memory.graph.partial="6.2 GiB" llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 32B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen3.block_count u32 = 64 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type f16: 64 tensors llama_model_loader: - type q4_K: 353 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.81 GiB (4.93 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 32.76 B print_info: general.name = Qwen3 32B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-05-16T15:44:09.143Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 19000 --batch-size 512 --n-gpu-layers 65 --threads 16 --parallel 1 --tensor-split 22,22,21 --port 36533" time=2025-05-16T15:44:09.143Z level=INFO source=sched.go:452 msg="loaded runners" count=1 time=2025-05-16T15:44:09.143Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding" time=2025-05-16T15:44:09.143Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding" time=2025-05-16T15:44:09.150Z level=INFO source=runner.go:853 msg="starting go runner" load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 3 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes Device 2: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2025-05-16T15:44:09.331Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-05-16T15:44:09.332Z level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:36533" time=2025-05-16T15:44:09.395Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 15831 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4060 Ti) - 15831 MiB free llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 4060 Ti) - 15831 MiB free llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 32B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen3.block_count u32 = 64 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type f16: 64 tensors llama_model_loader: - type q4_K: 353 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.81 GiB (4.93 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 5120 print_info: n_layer = 64 print_info: n_head = 64 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 25600 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 32B print_info: model params = 32.76 B print_info: general.name = Qwen3 32B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) [GIN] 2025/05/16 - 15:44:10 | 200 | 28.53µs | 127.0.0.1 | HEAD "/" [GIN] 2025/05/16 - 15:44:10 | 200 | 102.47µs | 127.0.0.1 | GET "/api/ps" load_tensors: offloading 64 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 65/65 layers to GPU load_tensors: CUDA0 model buffer size = 6300.10 MiB load_tensors: CUDA1 model buffer size = 6171.19 MiB load_tensors: CUDA2 model buffer size = 6371.11 MiB load_tensors: CPU_Mapped model buffer size = 417.30 MiB [GIN] 2025/05/16 - 15:44:12 | 200 | 24.929µs | 127.0.0.1 | HEAD "/" [GIN] 2025/05/16 - 15:44:12 | 200 | 34.071µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/05/16 - 15:44:14 | 200 | 25.22µs | 127.0.0.1 | HEAD "/" [GIN] 2025/05/16 - 15:44:14 | 200 | 25.99µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/05/16 - 15:44:16 | 200 | 22.209µs | 127.0.0.1 | HEAD "/" [GIN] 2025/05/16 - 15:44:16 | 200 | 23.17µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/05/16 - 15:44:18 | 200 | 26.36µs | 127.0.0.1 | HEAD "/" [GIN] 2025/05/16 - 15:44:18 | 200 | 27.531µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/05/16 - 15:44:20 | 200 | 26.81µs | 127.0.0.1 | HEAD "/" [GIN] 2025/05/16 - 15:44:20 | 200 | 39.83µs | 127.0.0.1 | GET "/api/ps" llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 19000 llama_context: n_ctx_per_seq = 19000 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (19000) < n_ctx_train (40960) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 0.60 MiB init: kv_size = 19008, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1 init: CUDA0 KV buffer size = 1633.50 MiB init: CUDA1 KV buffer size = 1633.50 MiB init: CUDA2 KV buffer size = 1485.00 MiB llama_context: KV self size = 4752.00 MiB, K (f16): 2376.00 MiB, V (f16): 2376.00 MiB llama_context: pipeline parallelism enabled (n_copies=4) llama_context: CUDA0 compute buffer size = 2616.51 MiB llama_context: CUDA1 compute buffer size = 2616.51 MiB llama_context: CUDA2 compute buffer size = 2616.52 MiB llama_context: CUDA_Host compute buffer size = 158.52 MiB llama_context: graph nodes = 2438 llama_context: graph splits = 4 time=2025-05-16T15:44:21.182Z level=INFO source=server.go:628 msg="llama runner started in 12.04 seconds" ``` ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 18:58:19 -05:00
Author
Owner

@johnnysn commented on GitHub (May 17, 2025):

It looks like I'm facing the same problem here. But the issue seems to be related to Qwen3 only. My server has 3 RTX 3090's and when I try to deploy Qwen3-32B-Q6_K using a 32k context size, the logs report a total required memory of 67.3 GiB, even though the actual VRAM usage is about 36 GiB. If I increase the context size just a tiny bit, Ollama starts to set some layers to the CPU, which significantly hurts performance.

UPDATE: Just upgraded to version 0.7.0 and the behavior persists

The logs:

gpu=GPU-87562afb-5b3f-53ed-67d0-0ca22728310e name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.1 GiB" now.total="23.6 GiB" now.free="23.1 GiB" now.used="459.5 MiB"
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.453-03:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-71bb38d6-a9e1-a8fb-81a1-be53f9c998dd name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="276.9 MiB"
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.629-03:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-db8680a0-d1b7-7702-914c-fab5322aaa6a name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="276.9 MiB"
mai 17 11:51:48 urano ollama[1900]: releasing cuda driver library
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.629-03:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split=22,22,21 memory.available="[23.3 GiB 23.3 GiB 23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="67.3 GiB" memory.required.partial="67.3 GiB" memory.required.kv="8.0 GiB" memory.required.allocations="[22.6 GiB 22.7 GiB 22.1 GiB]" memory.weights.total="24.4 GiB" memory.weights.repeating="23.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="10.7 GiB" memory.graph.partial="10.7 GiB"
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.629-03:00 level=INFO source=server.go:186 msg="enabling flash attention"
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.629-03:00 level=WARN source=server.go:194 msg="kv cache type not supported by model" type=""
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.629-03:00 level=DEBUG source=server.go:263 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: loaded meta data with 28 key-value pairs and 707 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-c4c7c3cb6da260df1fe1d3cfd090a32dc7cc348f1278158be18e301f390d6f6e (version GGUF V3 (latest))
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv   0:                       general.architecture str              = qwen3
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv   1:                               general.type str              = model
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv   2:                               general.name str              = Qwen3 32B Instruct
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv   4:                           general.basename str              = Qwen3
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv   5:                         general.size_label str              = 32B
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv   6:                          qwen3.block_count u32              = 64
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv   7:                       qwen3.context_length u32              = 40960
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv   8:                     qwen3.embedding_length u32              = 5120
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv   9:                  qwen3.feed_forward_length u32              = 25600
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  10:                 qwen3.attention.head_count u32              = 64
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  11:              qwen3.attention.head_count_kv u32              = 8
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  12:                       qwen3.rope.freq_base f32              = 1000000.000000
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  13:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  14:                 qwen3.attention.key_length u32              = 128
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  15:               qwen3.attention.value_length u32              = 128
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = gpt2
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = qwen2
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  20:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 151645
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 151643
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 151643
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  26:               general.quantization_version u32              = 2
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv  27:                          general.file_type u32              = 18
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - type  f32:  257 tensors
mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - type q6_K:  450 tensors
mai 17 11:51:48 urano ollama[1900]: print_info: file format = GGUF V3 (latest)
mai 17 11:51:48 urano ollama[1900]: print_info: file type   = Q6_K
mai 17 11:51:48 urano ollama[1900]: print_info: file size   = 25.03 GiB (6.56 BPW)
mai 17 11:51:48 urano ollama[1900]: init_tokenizer: initializing tokenizer for type 2
mai 17 11:51:48 urano ollama[1900]: load: control token: 151660 '<|fim_middle|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151653 '<|vision_end|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151648 '<|box_start|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151649 '<|box_end|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151655 '<|image_pad|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151651 '<|quad_end|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151652 '<|vision_start|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151654 '<|vision_pad|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151656 '<|video_pad|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151644 '<|im_start|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: control token: 151650 '<|quad_start|>' is not marked as EOG
mai 17 11:51:48 urano ollama[1900]: load: special tokens cache size = 26
mai 17 11:51:48 urano ollama[1900]: load: token to piece cache size = 0.9311 MB
mai 17 11:51:48 urano ollama[1900]: print_info: arch             = qwen3
mai 17 11:51:48 urano ollama[1900]: print_info: vocab_only       = 1
mai 17 11:51:48 urano ollama[1900]: print_info: model type       = ?B
mai 17 11:51:48 urano ollama[1900]: print_info: model params     = 32.76 B
mai 17 11:51:48 urano ollama[1900]: print_info: general.name     = Qwen3 32B Instruct
mai 17 11:51:48 urano ollama[1900]: print_info: vocab type       = BPE
mai 17 11:51:48 urano ollama[1900]: print_info: n_vocab          = 151936
mai 17 11:51:48 urano ollama[1900]: print_info: n_merges         = 151387
mai 17 11:51:48 urano ollama[1900]: print_info: BOS token        = 151643 '<|endoftext|>'
mai 17 11:51:48 urano ollama[1900]: print_info: EOS token        = 151645 '<|im_end|>'
mai 17 11:51:48 urano ollama[1900]: print_info: EOT token        = 151645 '<|im_end|>'
mai 17 11:51:48 urano ollama[1900]: print_info: PAD token        = 151643 '<|endoftext|>'
mai 17 11:51:48 urano ollama[1900]: print_info: LF token         = 198 'Ċ'
mai 17 11:51:48 urano ollama[1900]: print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
mai 17 11:51:48 urano ollama[1900]: print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
mai 17 11:51:48 urano ollama[1900]: print_info: FIM MID token    = 151660 '<|fim_middle|>'
mai 17 11:51:48 urano ollama[1900]: print_info: FIM PAD token    = 151662 '<|fim_pad|>'
mai 17 11:51:48 urano ollama[1900]: print_info: FIM REP token    = 151663 '<|repo_name|>'
mai 17 11:51:48 urano ollama[1900]: print_info: FIM SEP token    = 151664 '<|file_sep|>'
mai 17 11:51:48 urano ollama[1900]: print_info: EOG token        = 151643 '<|endoftext|>'
mai 17 11:51:48 urano ollama[1900]: print_info: EOG token        = 151645 '<|im_end|>'
mai 17 11:51:48 urano ollama[1900]: print_info: EOG token        = 151662 '<|fim_pad|>'
mai 17 11:51:48 urano ollama[1900]: print_info: EOG token        = 151663 '<|repo_name|>'
mai 17 11:51:48 urano ollama[1900]: print_info: EOG token        = 151664 '<|file_sep|>'
mai 17 11:51:48 urano ollama[1900]: print_info: max token length = 256
mai 17 11:51:48 urano ollama[1900]: llama_model_load: vocab only - skipping tensors
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.813-03:00 level=DEBUG source=server.go:339 msg="adding gpu library" path=/usr/local/lib/ollama/cuda_v12
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.813-03:00 level=DEBUG source=server.go:346 msg="adding gpu dependency paths" paths=[/usr/local/lib/ollama/cuda_v12]
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.813-03:00 level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-c4c7c3cb6da260df1fe1d3cfd090a32dc7cc348f1278158be18e301f390d6f6e --ctx-size 32768 --batch-size 512 --n-gpu-layers 65 --verbose --threads 32 --flash-attn --parallel 1 --tensor-split 22,22,21 --port 43719"
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.813-03:00 level=DEBUG source=server.go:429 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin OLLAMA_HOST=0.0.0.0 OLLAMA_KEEP_ALIVE=10m OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=true OLLAMA_MAX_LOADED_MODELS=9 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12 LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-71bb38d6-a9e1-a8fb-81a1-be53f9c998dd,GPU-db8680a0-d1b7-7702-914c-fab5322aaa6a,GPU-87562afb-5b3f-53ed-67d0-0ca22728310e]"
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.814-03:00 level=INFO source=sched.go:452 msg="loaded runners" count=1
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.814-03:00 level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.814-03:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding"
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.826-03:00 level=INFO source=runner.go:853 msg="starting go runner"
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.826-03:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=/usr/local/lib/ollama
mai 17 11:51:48 urano ollama[1900]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.830-03:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=/usr/local/lib/ollama/cuda_v12
mai 17 11:51:48 urano ollama[1900]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
mai 17 11:51:48 urano ollama[1900]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
mai 17 11:51:48 urano ollama[1900]: ggml_cuda_init: found 3 CUDA devices:
mai 17 11:51:48 urano ollama[1900]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
mai 17 11:51:48 urano ollama[1900]:   Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
mai 17 11:51:48 urano ollama[1900]:   Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
mai 17 11:51:49 urano ollama[1900]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
mai 17 11:51:49 urano ollama[1900]: time=2025-05-17T11:51:49.150-03:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
mai 17 11:51:49 urano ollama[1900]: time=2025-05-17T11:51:49.151-03:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:43719"
mai 17 11:51:49 urano ollama[1900]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23873 MiB free
mai 17 11:51:49 urano ollama[1900]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23873 MiB free
mai 17 11:51:49 urano ollama[1900]: llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23657 MiB free
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: loaded meta data with 28 key-value pairs and 707 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-c4c7c3cb6da260df1fe1d3cfd090a32dc7cc348f1278158be18e301f390d6f6e (version GGUF V3 (latest))
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv   0:                       general.architecture str              = qwen3
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv   1:                               general.type str              = model
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv   2:                               general.name str              = Qwen3 32B Instruct
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv   4:                           general.basename str              = Qwen3
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv   5:                         general.size_label str              = 32B
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv   6:                          qwen3.block_count u32              = 64
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv   7:                       qwen3.context_length u32              = 40960
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv   8:                     qwen3.embedding_length u32              = 5120
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv   9:                  qwen3.feed_forward_length u32              = 25600
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  10:                 qwen3.attention.head_count u32              = 64
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  11:              qwen3.attention.head_count_kv u32              = 8
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  12:                       qwen3.rope.freq_base f32              = 1000000.000000
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  13:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  14:                 qwen3.attention.key_length u32              = 128
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  15:               qwen3.attention.value_length u32              = 128
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = gpt2
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = qwen2
mai 17 11:51:49 urano ollama[1900]: time=2025-05-17T11:51:49.316-03:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  20:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 151645
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 151643
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 151643
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  26:               general.quantization_version u32              = 2
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv  27:                          general.file_type u32              = 18
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - type  f32:  257 tensors
mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - type q6_K:  450 tensors
mai 17 11:51:49 urano ollama[1900]: print_info: file format = GGUF V3 (latest)
mai 17 11:51:49 urano ollama[1900]: print_info: file type   = Q6_K
mai 17 11:51:49 urano ollama[1900]: print_info: file size   = 25.03 GiB (6.56 BPW)
mai 17 11:51:49 urano ollama[1900]: init_tokenizer: initializing tokenizer for type 2
mai 17 11:51:49 urano ollama[1900]: load: control token: 151660 '<|fim_middle|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151653 '<|vision_end|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151648 '<|box_start|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151649 '<|box_end|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151655 '<|image_pad|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151651 '<|quad_end|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151652 '<|vision_start|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151654 '<|vision_pad|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151656 '<|video_pad|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151644 '<|im_start|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: control token: 151650 '<|quad_start|>' is not marked as EOG
mai 17 11:51:49 urano ollama[1900]: load: special tokens cache size = 26
mai 17 11:51:49 urano ollama[1900]: load: token to piece cache size = 0.9311 MB
mai 17 11:51:49 urano ollama[1900]: print_info: arch             = qwen3
mai 17 11:51:49 urano ollama[1900]: print_info: vocab_only       = 0
mai 17 11:51:49 urano ollama[1900]: print_info: n_ctx_train      = 40960
mai 17 11:51:49 urano ollama[1900]: print_info: n_embd           = 5120
mai 17 11:51:49 urano ollama[1900]: print_info: n_layer          = 64
mai 17 11:51:49 urano ollama[1900]: print_info: n_head           = 64
mai 17 11:51:49 urano ollama[1900]: print_info: n_head_kv        = 8
mai 17 11:51:49 urano ollama[1900]: print_info: n_rot            = 128
mai 17 11:51:49 urano ollama[1900]: print_info: n_swa            = 0
mai 17 11:51:49 urano ollama[1900]: print_info: n_swa_pattern    = 1
mai 17 11:51:49 urano ollama[1900]: print_info: n_embd_head_k    = 128
mai 17 11:51:49 urano ollama[1900]: print_info: n_embd_head_v    = 128
mai 17 11:51:49 urano ollama[1900]: print_info: n_gqa            = 8
mai 17 11:51:49 urano ollama[1900]: print_info: n_embd_k_gqa     = 1024
mai 17 11:51:49 urano ollama[1900]: print_info: n_embd_v_gqa     = 1024
mai 17 11:51:49 urano ollama[1900]: print_info: f_norm_eps       = 0.0e+00
mai 17 11:51:49 urano ollama[1900]: print_info: f_norm_rms_eps   = 1.0e-06
mai 17 11:51:49 urano ollama[1900]: print_info: f_clamp_kqv      = 0.0e+00
mai 17 11:51:49 urano ollama[1900]: print_info: f_max_alibi_bias = 0.0e+00
mai 17 11:51:49 urano ollama[1900]: print_info: f_logit_scale    = 0.0e+00
mai 17 11:51:49 urano ollama[1900]: print_info: f_attn_scale     = 0.0e+00
mai 17 11:51:49 urano ollama[1900]: print_info: n_ff             = 25600
mai 17 11:51:49 urano ollama[1900]: print_info: n_expert         = 0
mai 17 11:51:49 urano ollama[1900]: print_info: n_expert_used    = 0
mai 17 11:51:49 urano ollama[1900]: print_info: causal attn      = 1
mai 17 11:51:49 urano ollama[1900]: print_info: pooling type     = 0
mai 17 11:51:49 urano ollama[1900]: print_info: rope type        = 2
mai 17 11:51:49 urano ollama[1900]: print_info: rope scaling     = linear
mai 17 11:51:49 urano ollama[1900]: print_info: freq_base_train  = 1000000.0
mai 17 11:51:49 urano ollama[1900]: print_info: freq_scale_train = 1
mai 17 11:51:49 urano ollama[1900]: print_info: n_ctx_orig_yarn  = 40960
mai 17 11:51:49 urano ollama[1900]: print_info: rope_finetuned   = unknown
mai 17 11:51:49 urano ollama[1900]: print_info: ssm_d_conv       = 0
mai 17 11:51:49 urano ollama[1900]: print_info: ssm_d_inner      = 0
mai 17 11:51:49 urano ollama[1900]: print_info: ssm_d_state      = 0
mai 17 11:51:49 urano ollama[1900]: print_info: ssm_dt_rank      = 0
mai 17 11:51:49 urano ollama[1900]: print_info: ssm_dt_b_c_rms   = 0
mai 17 11:51:49 urano ollama[1900]: print_info: model type       = 32B
mai 17 11:51:49 urano ollama[1900]: print_info: model params     = 32.76 B
mai 17 11:51:49 urano ollama[1900]: print_info: general.name     = Qwen3 32B Instruct
mai 17 11:51:49 urano ollama[1900]: print_info: vocab type       = BPE
mai 17 11:51:49 urano ollama[1900]: print_info: n_vocab          = 151936
mai 17 11:51:49 urano ollama[1900]: print_info: n_merges         = 151387
mai 17 11:51:49 urano ollama[1900]: print_info: BOS token        = 151643 '<|endoftext|>'
mai 17 11:51:49 urano ollama[1900]: print_info: EOS token        = 151645 '<|im_end|>'
mai 17 11:51:49 urano ollama[1900]: print_info: EOT token        = 151645 '<|im_end|>'
mai 17 11:51:49 urano ollama[1900]: print_info: PAD token        = 151643 '<|endoftext|>'
mai 17 11:51:49 urano ollama[1900]: print_info: LF token         = 198 'Ċ'
mai 17 11:51:49 urano ollama[1900]: print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
mai 17 11:51:49 urano ollama[1900]: print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
mai 17 11:51:49 urano ollama[1900]: print_info: FIM MID token    = 151660 '<|fim_middle|>'
mai 17 11:51:49 urano ollama[1900]: print_info: FIM PAD token    = 151662 '<|fim_pad|>'
mai 17 11:51:49 urano ollama[1900]: print_info: FIM REP token    = 151663 '<|repo_name|>'
mai 17 11:51:49 urano ollama[1900]: print_info: FIM SEP token    = 151664 '<|file_sep|>'
mai 17 11:51:49 urano ollama[1900]: print_info: EOG token        = 151643 '<|endoftext|>'
mai 17 11:51:49 urano ollama[1900]: print_info: EOG token        = 151645 '<|im_end|>'
mai 17 11:51:49 urano ollama[1900]: print_info: EOG token        = 151662 '<|fim_pad|>'
mai 17 11:51:49 urano ollama[1900]: print_info: EOG token        = 151663 '<|repo_name|>'
mai 17 11:51:49 urano ollama[1900]: print_info: EOG token        = 151664 '<|file_sep|>'
mai 17 11:51:49 urano ollama[1900]: print_info: max token length = 256
mai 17 11:51:49 urano ollama[1900]: load_tensors: loading model tensors, this can take a while... (mmap = true)
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  22 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  23 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  24 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  25 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  26 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  27 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  28 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  29 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  30 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  31 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  32 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  33 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  34 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  35 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  36 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  37 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  38 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  39 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  40 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  41 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  42 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  43 assigned to device CUDA1, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  44 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  45 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  46 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  47 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  48 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  49 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  50 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  51 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  52 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  53 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  54 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  55 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  56 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  57 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  58 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  59 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  60 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  61 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  62 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  63 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: layer  64 assigned to device CUDA2, is_swa = 0
mai 17 11:51:49 urano ollama[1900]: load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
mai 17 11:51:51 urano ollama[1900]: load_tensors: offloading 64 repeating layers to GPU
mai 17 11:51:51 urano ollama[1900]: load_tensors: offloading output layer to GPU
mai 17 11:51:51 urano ollama[1900]: load_tensors: offloaded 65/65 layers to GPU
mai 17 11:51:51 urano ollama[1900]: load_tensors:        CUDA0 model buffer size =  8392.68 MiB
mai 17 11:51:51 urano ollama[1900]: load_tensors:        CUDA1 model buffer size =  8392.68 MiB
mai 17 11:51:51 urano ollama[1900]: load_tensors:        CUDA2 model buffer size =  8238.30 MiB
mai 17 11:51:51 urano ollama[1900]: load_tensors:   CPU_Mapped model buffer size =   608.57 MiB
mai 17 11:51:52 urano ollama[1900]: time=2025-05-17T11:51:52.076-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.11"
mai 17 11:51:52 urano ollama[1900]: time=2025-05-17T11:51:52.327-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.28"
mai 17 11:51:52 urano ollama[1900]: time=2025-05-17T11:51:52.578-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.45"
mai 17 11:51:52 urano ollama[1900]: time=2025-05-17T11:51:52.828-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.63"
mai 17 11:51:53 urano ollama[1900]: time=2025-05-17T11:51:53.079-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.80"
mai 17 11:51:53 urano ollama[1900]: time=2025-05-17T11:51:53.330-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.97"
mai 17 11:51:53 urano ollama[1900]: time=2025-05-17T11:51:53.581-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.98"
mai 17 11:51:54 urano ollama[1900]: llama_context: constructing llama_context
mai 17 11:51:54 urano ollama[1900]: llama_context: n_seq_max     = 1
mai 17 11:51:54 urano ollama[1900]: llama_context: n_ctx         = 32768
mai 17 11:51:54 urano ollama[1900]: llama_context: n_ctx_per_seq = 32768
mai 17 11:51:54 urano ollama[1900]: llama_context: n_batch       = 512
mai 17 11:51:54 urano ollama[1900]: llama_context: n_ubatch      = 512
mai 17 11:51:54 urano ollama[1900]: llama_context: causal_attn   = 1
mai 17 11:51:54 urano ollama[1900]: llama_context: flash_attn    = 1
mai 17 11:51:54 urano ollama[1900]: llama_context: freq_base     = 1000000.0
mai 17 11:51:54 urano ollama[1900]: llama_context: freq_scale    = 1
mai 17 11:51:54 urano ollama[1900]: llama_context: n_ctx_per_seq (32768) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
mai 17 11:51:54 urano ollama[1900]: set_abort_callback: call
mai 17 11:51:54 urano ollama[1900]: llama_context:  CUDA_Host  output buffer size =     0.60 MiB
mai 17 11:51:54 urano ollama[1900]: llama_context: n_ctx = 32768
mai 17 11:51:54 urano ollama[1900]: llama_context: n_ctx = 32768 (padded)
mai 17 11:51:54 urano ollama[1900]: init: kv_size = 32768, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
mai 17 11:51:54 urano ollama[1900]: init: layer   0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer   1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer   2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer   3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer   4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer   5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer   6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer   7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer   8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer   9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
mai 17 11:51:54 urano ollama[1900]: init: layer  22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  40: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  41: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  42: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  43: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1
mai 17 11:51:54 urano ollama[1900]: init: layer  44: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  45: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  46: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  47: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  48: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  49: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  50: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  51: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  52: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  53: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  54: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  55: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  56: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  57: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  58: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  59: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  60: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  61: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  62: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init: layer  63: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2
mai 17 11:51:54 urano ollama[1900]: init:      CUDA0 KV buffer size =  2816.00 MiB
mai 17 11:51:54 urano ollama[1900]: init:      CUDA1 KV buffer size =  2816.00 MiB
mai 17 11:51:54 urano ollama[1900]: init:      CUDA2 KV buffer size =  2560.00 MiB
mai 17 11:51:54 urano ollama[1900]: llama_context: KV self size  = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
mai 17 11:51:54 urano ollama[1900]: llama_context: enumerating backends
mai 17 11:51:54 urano ollama[1900]: llama_context: backend_ptrs.size() = 4
mai 17 11:51:54 urano ollama[1900]: llama_context: max_nodes = 65536
mai 17 11:51:54 urano ollama[1900]: llama_context: pipeline parallelism enabled (n_copies=4)
mai 17 11:51:54 urano ollama[1900]: llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
mai 17 11:51:54 urano ollama[1900]: llama_context: reserving graph for n_tokens = 512, n_seqs = 1
mai 17 11:51:54 urano ollama[1900]: time=2025-05-17T11:51:54.333-03:00 level=DEBUG source=server.go:634 msg="model load progress 1.00"
mai 17 11:51:54 urano ollama[1900]: llama_context: reserving graph for n_tokens = 1, n_seqs = 1
mai 17 11:51:54 urano ollama[1900]: llama_context: reserving graph for n_tokens = 512, n_seqs = 1
mai 17 11:51:54 urano ollama[1900]: llama_context:      CUDA0 compute buffer size =   470.01 MiB
mai 17 11:51:54 urano ollama[1900]: llama_context:      CUDA1 compute buffer size =   304.01 MiB
mai 17 11:51:54 urano ollama[1900]: llama_context:      CUDA2 compute buffer size =   474.77 MiB
mai 17 11:51:54 urano ollama[1900]: llama_context:  CUDA_Host compute buffer size =   266.02 MiB
mai 17 11:51:54 urano ollama[1900]: llama_context: graph nodes  = 2183
mai 17 11:51:54 urano ollama[1900]: llama_context: graph splits = 4
mai 17 11:51:54 urano ollama[1900]: time=2025-05-17T11:51:54.584-03:00 level=INFO source=server.go:628 msg="llama runner started in 5.77 seconds"

VRAM usage (nvtop):

Image

<!-- gh-comment-id:2888440434 --> @johnnysn commented on GitHub (May 17, 2025): It looks like I'm facing the same problem here. But the issue seems to be related to Qwen3 only. My server has 3 RTX 3090's and when I try to deploy Qwen3-32B-Q6_K using a 32k context size, the logs report a total required memory of 67.3 GiB, even though the actual VRAM usage is about 36 GiB. If I increase the context size just a tiny bit, Ollama starts to set some layers to the CPU, which significantly hurts performance. UPDATE: Just upgraded to version 0.7.0 and the behavior persists The logs: ``` gpu=GPU-87562afb-5b3f-53ed-67d0-0ca22728310e name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.1 GiB" now.total="23.6 GiB" now.free="23.1 GiB" now.used="459.5 MiB" mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.453-03:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-71bb38d6-a9e1-a8fb-81a1-be53f9c998dd name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="276.9 MiB" mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.629-03:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-db8680a0-d1b7-7702-914c-fab5322aaa6a name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="276.9 MiB" mai 17 11:51:48 urano ollama[1900]: releasing cuda driver library mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.629-03:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split=22,22,21 memory.available="[23.3 GiB 23.3 GiB 23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="67.3 GiB" memory.required.partial="67.3 GiB" memory.required.kv="8.0 GiB" memory.required.allocations="[22.6 GiB 22.7 GiB 22.1 GiB]" memory.weights.total="24.4 GiB" memory.weights.repeating="23.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="10.7 GiB" memory.graph.partial="10.7 GiB" mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.629-03:00 level=INFO source=server.go:186 msg="enabling flash attention" mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.629-03:00 level=WARN source=server.go:194 msg="kv cache type not supported by model" type="" mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.629-03:00 level=DEBUG source=server.go:263 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" mai 17 11:51:48 urano ollama[1900]: llama_model_loader: loaded meta data with 28 key-value pairs and 707 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-c4c7c3cb6da260df1fe1d3cfd090a32dc7cc348f1278158be18e301f390d6f6e (version GGUF V3 (latest)) mai 17 11:51:48 urano ollama[1900]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 0: general.architecture str = qwen3 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 1: general.type str = model mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 2: general.name str = Qwen3 32B Instruct mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 3: general.finetune str = Instruct mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 4: general.basename str = Qwen3 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 5: general.size_label str = 32B mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 6: qwen3.block_count u32 = 64 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 7: qwen3.context_length u32 = 40960 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 8: qwen3.embedding_length u32 = 5120 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 25600 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 64 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 26: general.quantization_version u32 = 2 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - kv 27: general.file_type u32 = 18 mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - type f32: 257 tensors mai 17 11:51:48 urano ollama[1900]: llama_model_loader: - type q6_K: 450 tensors mai 17 11:51:48 urano ollama[1900]: print_info: file format = GGUF V3 (latest) mai 17 11:51:48 urano ollama[1900]: print_info: file type = Q6_K mai 17 11:51:48 urano ollama[1900]: print_info: file size = 25.03 GiB (6.56 BPW) mai 17 11:51:48 urano ollama[1900]: init_tokenizer: initializing tokenizer for type 2 mai 17 11:51:48 urano ollama[1900]: load: control token: 151660 '<|fim_middle|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151659 '<|fim_prefix|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151653 '<|vision_end|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151648 '<|box_start|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151646 '<|object_ref_start|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151649 '<|box_end|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151655 '<|image_pad|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151651 '<|quad_end|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151647 '<|object_ref_end|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151652 '<|vision_start|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151654 '<|vision_pad|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151656 '<|video_pad|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151644 '<|im_start|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151661 '<|fim_suffix|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: control token: 151650 '<|quad_start|>' is not marked as EOG mai 17 11:51:48 urano ollama[1900]: load: special tokens cache size = 26 mai 17 11:51:48 urano ollama[1900]: load: token to piece cache size = 0.9311 MB mai 17 11:51:48 urano ollama[1900]: print_info: arch = qwen3 mai 17 11:51:48 urano ollama[1900]: print_info: vocab_only = 1 mai 17 11:51:48 urano ollama[1900]: print_info: model type = ?B mai 17 11:51:48 urano ollama[1900]: print_info: model params = 32.76 B mai 17 11:51:48 urano ollama[1900]: print_info: general.name = Qwen3 32B Instruct mai 17 11:51:48 urano ollama[1900]: print_info: vocab type = BPE mai 17 11:51:48 urano ollama[1900]: print_info: n_vocab = 151936 mai 17 11:51:48 urano ollama[1900]: print_info: n_merges = 151387 mai 17 11:51:48 urano ollama[1900]: print_info: BOS token = 151643 '<|endoftext|>' mai 17 11:51:48 urano ollama[1900]: print_info: EOS token = 151645 '<|im_end|>' mai 17 11:51:48 urano ollama[1900]: print_info: EOT token = 151645 '<|im_end|>' mai 17 11:51:48 urano ollama[1900]: print_info: PAD token = 151643 '<|endoftext|>' mai 17 11:51:48 urano ollama[1900]: print_info: LF token = 198 'Ċ' mai 17 11:51:48 urano ollama[1900]: print_info: FIM PRE token = 151659 '<|fim_prefix|>' mai 17 11:51:48 urano ollama[1900]: print_info: FIM SUF token = 151661 '<|fim_suffix|>' mai 17 11:51:48 urano ollama[1900]: print_info: FIM MID token = 151660 '<|fim_middle|>' mai 17 11:51:48 urano ollama[1900]: print_info: FIM PAD token = 151662 '<|fim_pad|>' mai 17 11:51:48 urano ollama[1900]: print_info: FIM REP token = 151663 '<|repo_name|>' mai 17 11:51:48 urano ollama[1900]: print_info: FIM SEP token = 151664 '<|file_sep|>' mai 17 11:51:48 urano ollama[1900]: print_info: EOG token = 151643 '<|endoftext|>' mai 17 11:51:48 urano ollama[1900]: print_info: EOG token = 151645 '<|im_end|>' mai 17 11:51:48 urano ollama[1900]: print_info: EOG token = 151662 '<|fim_pad|>' mai 17 11:51:48 urano ollama[1900]: print_info: EOG token = 151663 '<|repo_name|>' mai 17 11:51:48 urano ollama[1900]: print_info: EOG token = 151664 '<|file_sep|>' mai 17 11:51:48 urano ollama[1900]: print_info: max token length = 256 mai 17 11:51:48 urano ollama[1900]: llama_model_load: vocab only - skipping tensors mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.813-03:00 level=DEBUG source=server.go:339 msg="adding gpu library" path=/usr/local/lib/ollama/cuda_v12 mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.813-03:00 level=DEBUG source=server.go:346 msg="adding gpu dependency paths" paths=[/usr/local/lib/ollama/cuda_v12] mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.813-03:00 level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-c4c7c3cb6da260df1fe1d3cfd090a32dc7cc348f1278158be18e301f390d6f6e --ctx-size 32768 --batch-size 512 --n-gpu-layers 65 --verbose --threads 32 --flash-attn --parallel 1 --tensor-split 22,22,21 --port 43719" mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.813-03:00 level=DEBUG source=server.go:429 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin OLLAMA_HOST=0.0.0.0 OLLAMA_KEEP_ALIVE=10m OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=true OLLAMA_MAX_LOADED_MODELS=9 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12 LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-71bb38d6-a9e1-a8fb-81a1-be53f9c998dd,GPU-db8680a0-d1b7-7702-914c-fab5322aaa6a,GPU-87562afb-5b3f-53ed-67d0-0ca22728310e]" mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.814-03:00 level=INFO source=sched.go:452 msg="loaded runners" count=1 mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.814-03:00 level=INFO source=server.go:589 msg="waiting for llama runner to start responding" mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.814-03:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding" mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.826-03:00 level=INFO source=runner.go:853 msg="starting go runner" mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.826-03:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=/usr/local/lib/ollama mai 17 11:51:48 urano ollama[1900]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so mai 17 11:51:48 urano ollama[1900]: time=2025-05-17T11:51:48.830-03:00 level=DEBUG source=ggml.go:93 msg="ggml backend load all from path" path=/usr/local/lib/ollama/cuda_v12 mai 17 11:51:48 urano ollama[1900]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no mai 17 11:51:48 urano ollama[1900]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no mai 17 11:51:48 urano ollama[1900]: ggml_cuda_init: found 3 CUDA devices: mai 17 11:51:48 urano ollama[1900]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes mai 17 11:51:48 urano ollama[1900]: Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes mai 17 11:51:48 urano ollama[1900]: Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes mai 17 11:51:49 urano ollama[1900]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so mai 17 11:51:49 urano ollama[1900]: time=2025-05-17T11:51:49.150-03:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) mai 17 11:51:49 urano ollama[1900]: time=2025-05-17T11:51:49.151-03:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:43719" mai 17 11:51:49 urano ollama[1900]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23873 MiB free mai 17 11:51:49 urano ollama[1900]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23873 MiB free mai 17 11:51:49 urano ollama[1900]: llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23657 MiB free mai 17 11:51:49 urano ollama[1900]: llama_model_loader: loaded meta data with 28 key-value pairs and 707 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-c4c7c3cb6da260df1fe1d3cfd090a32dc7cc348f1278158be18e301f390d6f6e (version GGUF V3 (latest)) mai 17 11:51:49 urano ollama[1900]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 0: general.architecture str = qwen3 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 1: general.type str = model mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 2: general.name str = Qwen3 32B Instruct mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 3: general.finetune str = Instruct mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 4: general.basename str = Qwen3 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 5: general.size_label str = 32B mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 6: qwen3.block_count u32 = 64 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 7: qwen3.context_length u32 = 40960 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 8: qwen3.embedding_length u32 = 5120 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 25600 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 64 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2 mai 17 11:51:49 urano ollama[1900]: time=2025-05-17T11:51:49.316-03:00 level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 26: general.quantization_version u32 = 2 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - kv 27: general.file_type u32 = 18 mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - type f32: 257 tensors mai 17 11:51:49 urano ollama[1900]: llama_model_loader: - type q6_K: 450 tensors mai 17 11:51:49 urano ollama[1900]: print_info: file format = GGUF V3 (latest) mai 17 11:51:49 urano ollama[1900]: print_info: file type = Q6_K mai 17 11:51:49 urano ollama[1900]: print_info: file size = 25.03 GiB (6.56 BPW) mai 17 11:51:49 urano ollama[1900]: init_tokenizer: initializing tokenizer for type 2 mai 17 11:51:49 urano ollama[1900]: load: control token: 151660 '<|fim_middle|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151659 '<|fim_prefix|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151653 '<|vision_end|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151648 '<|box_start|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151646 '<|object_ref_start|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151649 '<|box_end|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151655 '<|image_pad|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151651 '<|quad_end|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151647 '<|object_ref_end|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151652 '<|vision_start|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151654 '<|vision_pad|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151656 '<|video_pad|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151644 '<|im_start|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151661 '<|fim_suffix|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: control token: 151650 '<|quad_start|>' is not marked as EOG mai 17 11:51:49 urano ollama[1900]: load: special tokens cache size = 26 mai 17 11:51:49 urano ollama[1900]: load: token to piece cache size = 0.9311 MB mai 17 11:51:49 urano ollama[1900]: print_info: arch = qwen3 mai 17 11:51:49 urano ollama[1900]: print_info: vocab_only = 0 mai 17 11:51:49 urano ollama[1900]: print_info: n_ctx_train = 40960 mai 17 11:51:49 urano ollama[1900]: print_info: n_embd = 5120 mai 17 11:51:49 urano ollama[1900]: print_info: n_layer = 64 mai 17 11:51:49 urano ollama[1900]: print_info: n_head = 64 mai 17 11:51:49 urano ollama[1900]: print_info: n_head_kv = 8 mai 17 11:51:49 urano ollama[1900]: print_info: n_rot = 128 mai 17 11:51:49 urano ollama[1900]: print_info: n_swa = 0 mai 17 11:51:49 urano ollama[1900]: print_info: n_swa_pattern = 1 mai 17 11:51:49 urano ollama[1900]: print_info: n_embd_head_k = 128 mai 17 11:51:49 urano ollama[1900]: print_info: n_embd_head_v = 128 mai 17 11:51:49 urano ollama[1900]: print_info: n_gqa = 8 mai 17 11:51:49 urano ollama[1900]: print_info: n_embd_k_gqa = 1024 mai 17 11:51:49 urano ollama[1900]: print_info: n_embd_v_gqa = 1024 mai 17 11:51:49 urano ollama[1900]: print_info: f_norm_eps = 0.0e+00 mai 17 11:51:49 urano ollama[1900]: print_info: f_norm_rms_eps = 1.0e-06 mai 17 11:51:49 urano ollama[1900]: print_info: f_clamp_kqv = 0.0e+00 mai 17 11:51:49 urano ollama[1900]: print_info: f_max_alibi_bias = 0.0e+00 mai 17 11:51:49 urano ollama[1900]: print_info: f_logit_scale = 0.0e+00 mai 17 11:51:49 urano ollama[1900]: print_info: f_attn_scale = 0.0e+00 mai 17 11:51:49 urano ollama[1900]: print_info: n_ff = 25600 mai 17 11:51:49 urano ollama[1900]: print_info: n_expert = 0 mai 17 11:51:49 urano ollama[1900]: print_info: n_expert_used = 0 mai 17 11:51:49 urano ollama[1900]: print_info: causal attn = 1 mai 17 11:51:49 urano ollama[1900]: print_info: pooling type = 0 mai 17 11:51:49 urano ollama[1900]: print_info: rope type = 2 mai 17 11:51:49 urano ollama[1900]: print_info: rope scaling = linear mai 17 11:51:49 urano ollama[1900]: print_info: freq_base_train = 1000000.0 mai 17 11:51:49 urano ollama[1900]: print_info: freq_scale_train = 1 mai 17 11:51:49 urano ollama[1900]: print_info: n_ctx_orig_yarn = 40960 mai 17 11:51:49 urano ollama[1900]: print_info: rope_finetuned = unknown mai 17 11:51:49 urano ollama[1900]: print_info: ssm_d_conv = 0 mai 17 11:51:49 urano ollama[1900]: print_info: ssm_d_inner = 0 mai 17 11:51:49 urano ollama[1900]: print_info: ssm_d_state = 0 mai 17 11:51:49 urano ollama[1900]: print_info: ssm_dt_rank = 0 mai 17 11:51:49 urano ollama[1900]: print_info: ssm_dt_b_c_rms = 0 mai 17 11:51:49 urano ollama[1900]: print_info: model type = 32B mai 17 11:51:49 urano ollama[1900]: print_info: model params = 32.76 B mai 17 11:51:49 urano ollama[1900]: print_info: general.name = Qwen3 32B Instruct mai 17 11:51:49 urano ollama[1900]: print_info: vocab type = BPE mai 17 11:51:49 urano ollama[1900]: print_info: n_vocab = 151936 mai 17 11:51:49 urano ollama[1900]: print_info: n_merges = 151387 mai 17 11:51:49 urano ollama[1900]: print_info: BOS token = 151643 '<|endoftext|>' mai 17 11:51:49 urano ollama[1900]: print_info: EOS token = 151645 '<|im_end|>' mai 17 11:51:49 urano ollama[1900]: print_info: EOT token = 151645 '<|im_end|>' mai 17 11:51:49 urano ollama[1900]: print_info: PAD token = 151643 '<|endoftext|>' mai 17 11:51:49 urano ollama[1900]: print_info: LF token = 198 'Ċ' mai 17 11:51:49 urano ollama[1900]: print_info: FIM PRE token = 151659 '<|fim_prefix|>' mai 17 11:51:49 urano ollama[1900]: print_info: FIM SUF token = 151661 '<|fim_suffix|>' mai 17 11:51:49 urano ollama[1900]: print_info: FIM MID token = 151660 '<|fim_middle|>' mai 17 11:51:49 urano ollama[1900]: print_info: FIM PAD token = 151662 '<|fim_pad|>' mai 17 11:51:49 urano ollama[1900]: print_info: FIM REP token = 151663 '<|repo_name|>' mai 17 11:51:49 urano ollama[1900]: print_info: FIM SEP token = 151664 '<|file_sep|>' mai 17 11:51:49 urano ollama[1900]: print_info: EOG token = 151643 '<|endoftext|>' mai 17 11:51:49 urano ollama[1900]: print_info: EOG token = 151645 '<|im_end|>' mai 17 11:51:49 urano ollama[1900]: print_info: EOG token = 151662 '<|fim_pad|>' mai 17 11:51:49 urano ollama[1900]: print_info: EOG token = 151663 '<|repo_name|>' mai 17 11:51:49 urano ollama[1900]: print_info: EOG token = 151664 '<|file_sep|>' mai 17 11:51:49 urano ollama[1900]: print_info: max token length = 256 mai 17 11:51:49 urano ollama[1900]: load_tensors: loading model tensors, this can take a while... (mmap = true) mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 0 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 1 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 2 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 3 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 4 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 6 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 7 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 8 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 9 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 10 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 11 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 12 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 13 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 14 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 15 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 16 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 17 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 18 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 19 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 20 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 21 assigned to device CUDA0, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 22 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 23 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 24 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 25 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 26 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 27 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 28 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 29 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 30 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 31 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 32 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 33 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 34 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 35 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 36 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 37 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 38 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 39 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 40 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 41 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 42 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 43 assigned to device CUDA1, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 44 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 45 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 46 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 47 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 48 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 49 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 50 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 51 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 52 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 53 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 54 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 55 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 56 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 57 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 58 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 59 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 60 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 61 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 62 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 63 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: layer 64 assigned to device CUDA2, is_swa = 0 mai 17 11:51:49 urano ollama[1900]: load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead mai 17 11:51:51 urano ollama[1900]: load_tensors: offloading 64 repeating layers to GPU mai 17 11:51:51 urano ollama[1900]: load_tensors: offloading output layer to GPU mai 17 11:51:51 urano ollama[1900]: load_tensors: offloaded 65/65 layers to GPU mai 17 11:51:51 urano ollama[1900]: load_tensors: CUDA0 model buffer size = 8392.68 MiB mai 17 11:51:51 urano ollama[1900]: load_tensors: CUDA1 model buffer size = 8392.68 MiB mai 17 11:51:51 urano ollama[1900]: load_tensors: CUDA2 model buffer size = 8238.30 MiB mai 17 11:51:51 urano ollama[1900]: load_tensors: CPU_Mapped model buffer size = 608.57 MiB mai 17 11:51:52 urano ollama[1900]: time=2025-05-17T11:51:52.076-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.11" mai 17 11:51:52 urano ollama[1900]: time=2025-05-17T11:51:52.327-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.28" mai 17 11:51:52 urano ollama[1900]: time=2025-05-17T11:51:52.578-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.45" mai 17 11:51:52 urano ollama[1900]: time=2025-05-17T11:51:52.828-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.63" mai 17 11:51:53 urano ollama[1900]: time=2025-05-17T11:51:53.079-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.80" mai 17 11:51:53 urano ollama[1900]: time=2025-05-17T11:51:53.330-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.97" mai 17 11:51:53 urano ollama[1900]: time=2025-05-17T11:51:53.581-03:00 level=DEBUG source=server.go:634 msg="model load progress 0.98" mai 17 11:51:54 urano ollama[1900]: llama_context: constructing llama_context mai 17 11:51:54 urano ollama[1900]: llama_context: n_seq_max = 1 mai 17 11:51:54 urano ollama[1900]: llama_context: n_ctx = 32768 mai 17 11:51:54 urano ollama[1900]: llama_context: n_ctx_per_seq = 32768 mai 17 11:51:54 urano ollama[1900]: llama_context: n_batch = 512 mai 17 11:51:54 urano ollama[1900]: llama_context: n_ubatch = 512 mai 17 11:51:54 urano ollama[1900]: llama_context: causal_attn = 1 mai 17 11:51:54 urano ollama[1900]: llama_context: flash_attn = 1 mai 17 11:51:54 urano ollama[1900]: llama_context: freq_base = 1000000.0 mai 17 11:51:54 urano ollama[1900]: llama_context: freq_scale = 1 mai 17 11:51:54 urano ollama[1900]: llama_context: n_ctx_per_seq (32768) < n_ctx_train (40960) -- the full capacity of the model will not be utilized mai 17 11:51:54 urano ollama[1900]: set_abort_callback: call mai 17 11:51:54 urano ollama[1900]: llama_context: CUDA_Host output buffer size = 0.60 MiB mai 17 11:51:54 urano ollama[1900]: llama_context: n_ctx = 32768 mai 17 11:51:54 urano ollama[1900]: llama_context: n_ctx = 32768 (padded) mai 17 11:51:54 urano ollama[1900]: init: kv_size = 32768, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1 mai 17 11:51:54 urano ollama[1900]: init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0 mai 17 11:51:54 urano ollama[1900]: init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 40: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 41: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 42: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 43: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA1 mai 17 11:51:54 urano ollama[1900]: init: layer 44: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 45: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 46: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 47: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 48: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 49: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 50: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 51: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 52: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 53: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 54: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 55: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 56: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 57: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 58: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 59: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 60: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 61: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 62: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: layer 63: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA2 mai 17 11:51:54 urano ollama[1900]: init: CUDA0 KV buffer size = 2816.00 MiB mai 17 11:51:54 urano ollama[1900]: init: CUDA1 KV buffer size = 2816.00 MiB mai 17 11:51:54 urano ollama[1900]: init: CUDA2 KV buffer size = 2560.00 MiB mai 17 11:51:54 urano ollama[1900]: llama_context: KV self size = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB mai 17 11:51:54 urano ollama[1900]: llama_context: enumerating backends mai 17 11:51:54 urano ollama[1900]: llama_context: backend_ptrs.size() = 4 mai 17 11:51:54 urano ollama[1900]: llama_context: max_nodes = 65536 mai 17 11:51:54 urano ollama[1900]: llama_context: pipeline parallelism enabled (n_copies=4) mai 17 11:51:54 urano ollama[1900]: llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 mai 17 11:51:54 urano ollama[1900]: llama_context: reserving graph for n_tokens = 512, n_seqs = 1 mai 17 11:51:54 urano ollama[1900]: time=2025-05-17T11:51:54.333-03:00 level=DEBUG source=server.go:634 msg="model load progress 1.00" mai 17 11:51:54 urano ollama[1900]: llama_context: reserving graph for n_tokens = 1, n_seqs = 1 mai 17 11:51:54 urano ollama[1900]: llama_context: reserving graph for n_tokens = 512, n_seqs = 1 mai 17 11:51:54 urano ollama[1900]: llama_context: CUDA0 compute buffer size = 470.01 MiB mai 17 11:51:54 urano ollama[1900]: llama_context: CUDA1 compute buffer size = 304.01 MiB mai 17 11:51:54 urano ollama[1900]: llama_context: CUDA2 compute buffer size = 474.77 MiB mai 17 11:51:54 urano ollama[1900]: llama_context: CUDA_Host compute buffer size = 266.02 MiB mai 17 11:51:54 urano ollama[1900]: llama_context: graph nodes = 2183 mai 17 11:51:54 urano ollama[1900]: llama_context: graph splits = 4 mai 17 11:51:54 urano ollama[1900]: time=2025-05-17T11:51:54.584-03:00 level=INFO source=server.go:628 msg="llama runner started in 5.77 seconds" ``` VRAM usage (nvtop): ![Image](https://github.com/user-attachments/assets/89f3944c-0243-46a6-b3b3-6049613cdfa7)
Author
Owner

@Desslar commented on GitHub (May 19, 2025):

Yes, same issue. 3x 3090 using about 50% of available vram and pushing the rest to system ram. Not just qwen 3 but also Llama3.3 70B in my limited testing. I can barely fit 3.3 q4 with 2k context over 72GB VRAM whereas before I could fit 8K context with just 48GB.

I think ollama is not estimating Vram usage correctly

<!-- gh-comment-id:2889470925 --> @Desslar commented on GitHub (May 19, 2025): Yes, same issue. 3x 3090 using about 50% of available vram and pushing the rest to system ram. Not just qwen 3 but also Llama3.3 70B in my limited testing. I can barely fit 3.3 q4 with 2k context over 72GB VRAM whereas before I could fit 8K context with just 48GB. I think ollama is not estimating Vram usage correctly
Author
Owner

@woojh3690 commented on GitHub (May 19, 2025):

Same issue.

<!-- gh-comment-id:2889937726 --> @woojh3690 commented on GitHub (May 19, 2025): Same issue.
Author
Owner

@mcelrath commented on GitHub (May 19, 2025):

Same issue observed with 3x 7900xtx. I can see that with e.g. qwen3:30b (Q4_K_M quantization) ollama ps reports 76Gb usage while andgpu_top systematically reports about half that.

$ amdgpu_top -d | grep 'VRAM              :'; ollama ps
VRAM              : usage 12710 MiB, total 24560 MiB (usable 24525 MiB)
VRAM              : usage 11008 MiB, total 24560 MiB (usable 24525 MiB)
VRAM              : usage 10907 MiB, total 24560 MiB (usable 24525 MiB)
NAME         ID              SIZE     PROCESSOR    UNTIL   
qwen3:30b    2ee832bc15b5    76 GB    100% GPU     Forever    

This is using Aider, which tends to request a ton of context for reasons I don't understand.

<!-- gh-comment-id:2891221072 --> @mcelrath commented on GitHub (May 19, 2025): Same issue observed with 3x 7900xtx. I can see that with e.g. qwen3:30b (Q4_K_M quantization) `ollama ps` reports 76Gb usage while `andgpu_top` systematically reports about half that. ``` $ amdgpu_top -d | grep 'VRAM :'; ollama ps VRAM : usage 12710 MiB, total 24560 MiB (usable 24525 MiB) VRAM : usage 11008 MiB, total 24560 MiB (usable 24525 MiB) VRAM : usage 10907 MiB, total 24560 MiB (usable 24525 MiB) NAME ID SIZE PROCESSOR UNTIL qwen3:30b 2ee832bc15b5 76 GB 100% GPU Forever ``` This is using Aider, which tends to request a ton of context for reasons I don't understand.
Author
Owner

@konn-submarine-bu commented on GitHub (May 21, 2025):

same here, need bug fix!!!

<!-- gh-comment-id:2896944735 --> @konn-submarine-bu commented on GitHub (May 21, 2025): same here, need bug fix!!!
Author
Owner

@johnnysn commented on GitHub (May 24, 2025):

It seems like version 0.7.1 addresses some memory related issues, but I'm still seeing the same problem after updating from 0.7.0 to 0.7.1

Could it be related to multi-GPU setups?

Ollama version:

Image

Ollama reports a 70GB VRAM usage for running qwen3:32b-q8_0 with a context window of 24k tokens

Image

But the actual VRAM use is lower than 42GB:

Image

I should be able to use a much longer context window with this setup, but if I increase it just a bit, ollama starts pushing layers to the system RAM.

<!-- gh-comment-id:2906800839 --> @johnnysn commented on GitHub (May 24, 2025): It seems like version 0.7.1 addresses some [memory related issues](https://github.com/ollama/ollama/issues/10756), but I'm still seeing the same problem after updating from 0.7.0 to 0.7.1 Could it be related to multi-GPU setups? Ollama version: ![Image](https://github.com/user-attachments/assets/03804cf1-0f9d-41a7-acb4-613310104e02) Ollama reports a 70GB VRAM usage for running qwen3:32b-q8_0 with a context window of 24k tokens ![Image](https://github.com/user-attachments/assets/28a6ddde-fc32-4d75-92c7-f88000bcc58a) But the actual VRAM use is lower than 42GB: ![Image](https://github.com/user-attachments/assets/a753b0c1-b781-4e6f-8724-6699405dadf7) I should be able to use a much longer context window with this setup, but if I increase it just a bit, ollama starts pushing layers to the system RAM.
Author
Owner

@Fade78 commented on GitHub (May 28, 2025):

Yes, even 0.7.1 has the same problems. Now the estimation seems to be correct but still, after a VRAM usage largly inferior to the maximum VRAM, ollama will choose to distribute some load to the CPU...

<!-- gh-comment-id:2917275864 --> @Fade78 commented on GitHub (May 28, 2025): Yes, even 0.7.1 has the same problems. Now the estimation seems to be correct but still, after a VRAM usage largly inferior to the maximum VRAM, ollama will choose to distribute some load to the CPU...
Author
Owner

@jessegross commented on GitHub (Jun 16, 2025):

There is an early preview of Ollama's new memory management with the goal of comprehensively fixing these issues. It is still in development, however, if you want to compile from source and try it out, you can find it here: https://github.com/ollama/ollama/pull/11090

Please leave any feedback on that PR.

<!-- gh-comment-id:2978296291 --> @jessegross commented on GitHub (Jun 16, 2025): There is an early preview of Ollama's new memory management with the goal of comprehensively fixing these issues. It is still in development, however, if you want to compile from source and try it out, you can find it here: https://github.com/ollama/ollama/pull/11090 Please leave any feedback on that PR.
Author
Owner

@jessegross commented on GitHub (Sep 24, 2025):

I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.

<!-- gh-comment-id:3330111184 --> @jessegross commented on GitHub (Sep 24, 2025): I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7054