[GH-ISSUE #12982] Ollama 0.12.9 Windows - 500: llama runner process has terminated: cudaMalloc failed: out of memory - granite4:small-h #55114

Closed
opened 2026-04-29 08:21:34 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @ghost on GitHub (Nov 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12982

What is the issue?

I'm having trouble running granite4:small-h. Is this because it's a new model and Ollama is not yet compatible with it?

Relevant log output

time=2025-11-06T14:08:46.942+11:00 level=INFO source=routes.go:1524 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\user\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-11-06T14:08:46.956+11:00 level=INFO source=images.go:522 msg="total blobs: 43"
time=2025-11-06T14:08:46.958+11:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0"
time=2025-11-06T14:08:46.959+11:00 level=INFO source=routes.go:1577 msg="Listening on 127.0.0.1:11434 (version 0.12.9)"
time=2025-11-06T14:08:46.961+11:00 level=INFO source=runner.go:76 msg="discovering available GPUs..."
time=2025-11-06T14:08:46.983+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51420"
time=2025-11-06T14:08:49.743+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51431"
time=2025-11-06T14:08:50.302+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51440"
time=2025-11-06T14:08:50.908+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51450"
time=2025-11-06T14:08:50.910+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51451"
time=2025-11-06T14:08:50.910+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51452"
time=2025-11-06T14:08:50.910+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51453"
time=2025-11-06T14:08:51.244+11:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-3f8ecd83-47d0-81a2-ed3c-5c41dd1507fc filtered_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3060" libdirs=ollama,cuda_v13 driver=13.0 pci_id=0000:15:00.0 type=discrete total="12.0 GiB" available="11.8 GiB"
time=2025-11-06T14:08:51.244+11:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-cbe06948-85e0-6899-f43f-21e1865c0283 filtered_id="" library=CUDA compute=8.6 name=CUDA1 description="NVIDIA GeForce RTX 3060" libdirs=ollama,cuda_v13 driver=13.0 pci_id=0000:21:00.0 type=discrete total="12.0 GiB" available="11.5 GiB"
[GIN] 2025/11/06 - 14:08:51 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/11/06 - 14:08:51 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/11/06 - 14:08:51 | 200 |      8.0922ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/11/06 - 14:08:51 | 404 |      9.2379ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/11/06 - 14:08:53 | 404 |      7.2052ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/11/06 - 14:08:56 | 200 |     10.9063ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/11/06 - 14:08:56 | 200 |       620.8µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/11/06 - 14:08:57 | 404 |      8.5511ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/11/06 - 14:09:21 | 200 |        62.1µs |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/11/06 - 14:09:21 | 200 |      9.8435ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/11/06 - 14:09:40 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/11/06 - 14:09:40 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/11/06 - 14:11:07 | 200 |     10.3175ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/11/06 - 14:11:07 | 200 |       546.5µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/11/06 - 14:11:12 | 200 |     10.2099ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/11/06 - 14:11:12 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/11/06 - 14:11:20 | 200 |      9.6292ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/11/06 - 14:11:20 | 200 |       541.9µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/11/06 - 14:11:23 | 200 |     11.8044ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/11/06 - 14:11:23 | 200 |       588.8µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/11/06 - 14:11:24 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
time=2025-11-06T14:11:29.257+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51503"
time=2025-11-06T14:11:29.821+11:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-11-06T14:11:29.821+11:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=6 efficiency=0 threads=12
llama_model_loader: loaded meta data with 43 key-value pairs and 666 tensors from C:\Users\user\.ollama\models\blobs\sha256-23c6019f2e6fa56eb8872ecf8d9d1e4b2ebf3c087fb5275257ea07602d2a38b6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = granitehybrid
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Granite 4.0 H Small
llama_model_loader: - kv   3:                           general.basename str              = granite-4.0-h
llama_model_loader: - kv   4:                         general.size_label str              = small
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["language", "granite-4.0"]
llama_model_loader: - kv   7:                  granitehybrid.block_count u32              = 40
llama_model_loader: - kv   8:               granitehybrid.context_length u32              = 1048576
llama_model_loader: - kv   9:             granitehybrid.embedding_length u32              = 4096
llama_model_loader: - kv  10:          granitehybrid.feed_forward_length u32              = 768
llama_model_loader: - kv  11:         granitehybrid.attention.head_count u32              = 32
llama_model_loader: - kv  12:      granitehybrid.attention.head_count_kv arr[i32,40]      = [0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, ...
llama_model_loader: - kv  13:               granitehybrid.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: granitehybrid.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                 granitehybrid.expert_count u32              = 72
llama_model_loader: - kv  16:            granitehybrid.expert_used_count u32              = 10
llama_model_loader: - kv  17:                   granitehybrid.vocab_size u32              = 100352
llama_model_loader: - kv  18:         granitehybrid.rope.dimension_count u32              = 128
llama_model_loader: - kv  19:              granitehybrid.attention.scale f32              = 0.007813
llama_model_loader: - kv  20:              granitehybrid.embedding_scale f32              = 12.000000
llama_model_loader: - kv  21:               granitehybrid.residual_scale f32              = 0.220000
llama_model_loader: - kv  22:                  granitehybrid.logit_scale f32              = 16.000000
llama_model_loader: - kv  23: granitehybrid.expert_shared_feed_forward_length u32              = 1536
llama_model_loader: - kv  24:              granitehybrid.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  25:               granitehybrid.ssm.state_size u32              = 128
llama_model_loader: - kv  26:              granitehybrid.ssm.group_count u32              = 1
llama_model_loader: - kv  27:               granitehybrid.ssm.inner_size u32              = 8192
llama_model_loader: - kv  28:           granitehybrid.ssm.time_step_rank u32              = 128
llama_model_loader: - kv  29:       granitehybrid.rope.scaling.finetuned bool             = false
llama_model_loader: - kv  30:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  31:                         tokenizer.ggml.pre str              = dbrx
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  33:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  34:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  35:                tokenizer.ggml.bos_token_id u32              = 100257
llama_model_loader: - kv  36:                tokenizer.ggml.eos_token_id u32              = 100257
llama_model_loader: - kv  37:            tokenizer.ggml.unknown_token_id u32              = 100269
llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 100256
llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  40:                    tokenizer.chat_template str              = {%- set tools_system_message_prefix =...
llama_model_loader: - kv  41:               general.quantization_version u32              = 2
llama_model_loader: - kv  42:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  337 tensors
llama_model_loader: - type q4_K:  286 tensors
llama_model_loader: - type q6_K:   43 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.14 GiB (4.84 BPW) 
load: printing all EOG tokens:
load:   - 100257 ('<|end_of_text|>')
load:   - 100261 ('<|fim_pad|>')
load: special tokens cache size = 96
load: token to piece cache size = 0.6152 MB
print_info: arch             = granitehybrid
print_info: vocab_only       = 1
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_n_group      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 32.21 B
print_info: general.name     = Granite 4.0 H Small
print_info: f_embedding_scale = 0.000000
print_info: f_residual_scale  = 0.000000
print_info: f_attention_scale = 0.000000
print_info: n_ff_shexp        = 0
print_info: vocab type       = BPE
print_info: n_vocab          = 100352
print_info: n_merges         = 100000
print_info: BOS token        = 100257 '<|end_of_text|>'
print_info: EOS token        = 100257 '<|end_of_text|>'
print_info: EOT token        = 100257 '<|end_of_text|>'
print_info: UNK token        = 100269 '<|unk|>'
print_info: PAD token        = 100256 '<|pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
print_info: FIM MID token    = 100259 '<|fim_middle|>'
print_info: FIM PAD token    = 100261 '<|fim_pad|>'
print_info: EOG token        = 100257 '<|end_of_text|>'
print_info: EOG token        = 100261 '<|fim_pad|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-11-06T14:11:30.106+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\user\\.ollama\\models\\blobs\\sha256-23c6019f2e6fa56eb8872ecf8d9d1e4b2ebf3c087fb5275257ea07602d2a38b6 --port 51513"
time=2025-11-06T14:11:30.112+11:00 level=INFO source=server.go:470 msg="system memory" total="127.7 GiB" free="117.0 GiB" free_swap="134.4 GiB"
time=2025-11-06T14:11:30.113+11:00 level=INFO source=memory.go:37 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=C:\Users\user\.ollama\models\blobs\sha256-23c6019f2e6fa56eb8872ecf8d9d1e4b2ebf3c087fb5275257ea07602d2a38b6 library=CUDA parallel=1 required="22.4 GiB" gpus=2
time=2025-11-06T14:11:30.114+11:00 level=INFO source=server.go:522 msg=offload library=CUDA layers.requested=-1 layers.model=41 layers.offload=41 layers.split="[21 20]" memory.available="[11.8 GiB 11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.4 GiB" memory.required.partial="22.4 GiB" memory.required.kv="211.5 MiB" memory.required.allocations="[11.4 GiB 11.0 GiB]" memory.weights.total="18.1 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="321.6 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2025-11-06T14:11:30.183+11:00 level=INFO source=runner.go:910 msg="starting go runner"
load_backend: loaded CPU backend from C:\Users\user\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-3f8ecd83-47d0-81a2-ed3c-5c41dd1507fc
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-cbe06948-85e0-6899-f43f-21e1865c0283
load_backend: loaded CUDA backend from C:\Users\user\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2025-11-06T14:11:30.334+11:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-11-06T14:11:30.335+11:00 level=INFO source=runner.go:946 msg="Server listening on 127.0.0.1:51513"
time=2025-11-06T14:11:30.344+11:00 level=INFO source=runner.go:845 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:6 GPULayers:41[ID:GPU-3f8ecd83-47d0-81a2-ed3c-5c41dd1507fc Layers:21(0..20) ID:GPU-cbe06948-85e0-6899-f43f-21e1865c0283 Layers:20(21..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-06T14:11:30.345+11:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-11-06T14:11:30.345+11:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
ggml_backend_cuda_device_get_memory device GPU-3f8ecd83-47d0-81a2-ed3c-5c41dd1507fc utilizing NVML memory reporting free: 12689317888 total: 12884901888
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) (0000:15:00.0) - 12101 MiB free
ggml_backend_cuda_device_get_memory device GPU-cbe06948-85e0-6899-f43f-21e1865c0283 utilizing NVML memory reporting free: 12365746176 total: 12884901888
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3060) (0000:21:00.0) - 11792 MiB free
llama_model_loader: loaded meta data with 43 key-value pairs and 666 tensors from C:\Users\user\.ollama\models\blobs\sha256-23c6019f2e6fa56eb8872ecf8d9d1e4b2ebf3c087fb5275257ea07602d2a38b6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = granitehybrid
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Granite 4.0 H Small
llama_model_loader: - kv   3:                           general.basename str              = granite-4.0-h
llama_model_loader: - kv   4:                         general.size_label str              = small
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["language", "granite-4.0"]
llama_model_loader: - kv   7:                  granitehybrid.block_count u32              = 40
llama_model_loader: - kv   8:               granitehybrid.context_length u32              = 1048576
llama_model_loader: - kv   9:             granitehybrid.embedding_length u32              = 4096
llama_model_loader: - kv  10:          granitehybrid.feed_forward_length u32              = 768
llama_model_loader: - kv  11:         granitehybrid.attention.head_count u32              = 32
llama_model_loader: - kv  12:      granitehybrid.attention.head_count_kv arr[i32,40]      = [0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, ...
llama_model_loader: - kv  13:               granitehybrid.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: granitehybrid.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                 granitehybrid.expert_count u32              = 72
llama_model_loader: - kv  16:            granitehybrid.expert_used_count u32              = 10
llama_model_loader: - kv  17:                   granitehybrid.vocab_size u32              = 100352
llama_model_loader: - kv  18:         granitehybrid.rope.dimension_count u32              = 128
llama_model_loader: - kv  19:              granitehybrid.attention.scale f32              = 0.007813
llama_model_loader: - kv  20:              granitehybrid.embedding_scale f32              = 12.000000
llama_model_loader: - kv  21:               granitehybrid.residual_scale f32              = 0.220000
llama_model_loader: - kv  22:                  granitehybrid.logit_scale f32              = 16.000000
llama_model_loader: - kv  23: granitehybrid.expert_shared_feed_forward_length u32              = 1536
llama_model_loader: - kv  24:              granitehybrid.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  25:               granitehybrid.ssm.state_size u32              = 128
llama_model_loader: - kv  26:              granitehybrid.ssm.group_count u32              = 1
llama_model_loader: - kv  27:               granitehybrid.ssm.inner_size u32              = 8192
llama_model_loader: - kv  28:           granitehybrid.ssm.time_step_rank u32              = 128
llama_model_loader: - kv  29:       granitehybrid.rope.scaling.finetuned bool             = false
llama_model_loader: - kv  30:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  31:                         tokenizer.ggml.pre str              = dbrx
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  33:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  34:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  35:                tokenizer.ggml.bos_token_id u32              = 100257
llama_model_loader: - kv  36:                tokenizer.ggml.eos_token_id u32              = 100257
llama_model_loader: - kv  37:            tokenizer.ggml.unknown_token_id u32              = 100269
llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 100256
llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  40:                    tokenizer.chat_template str              = {%- set tools_system_message_prefix =...
llama_model_loader: - kv  41:               general.quantization_version u32              = 2
llama_model_loader: - kv  42:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  337 tensors
llama_model_loader: - type q4_K:  286 tensors
llama_model_loader: - type q6_K:   43 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.14 GiB (4.84 BPW) 
load: printing all EOG tokens:
load:   - 100257 ('<|end_of_text|>')
load:   - 100261 ('<|fim_pad|>')
load: special tokens cache size = 96
load: token to piece cache size = 0.6152 MB
print_info: arch             = granitehybrid
print_info: vocab_only       = 0
print_info: n_ctx_train      = 1048576
print_info: n_embd           = 4096
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = [0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0]
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
print_info: n_embd_k_gqa     = [0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0]
print_info: n_embd_v_gqa     = [0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0]
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 1.6e+01
print_info: f_attn_scale     = 7.8e-03
print_info: n_ff             = 768
print_info: n_expert         = 72
print_info: n_expert_used    = 10
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 1048576
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 4
print_info: ssm_d_inner      = 8192
print_info: ssm_d_state      = 128
print_info: ssm_dt_rank      = 128
print_info: ssm_n_group      = 1
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 32.21 B
print_info: general.name     = Granite 4.0 H Small
print_info: f_embedding_scale = 12.000000
print_info: f_residual_scale  = 0.220000
print_info: f_attention_scale = 0.007813
print_info: n_ff_shexp        = 1536
print_info: vocab type       = BPE
print_info: n_vocab          = 100352
print_info: n_merges         = 100000
print_info: BOS token        = 100257 '<|end_of_text|>'
print_info: EOS token        = 100257 '<|end_of_text|>'
print_info: EOT token        = 100257 '<|end_of_text|>'
print_info: UNK token        = 100269 '<|unk|>'
print_info: PAD token        = 100256 '<|pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
print_info: FIM MID token    = 100259 '<|fim_middle|>'
print_info: FIM PAD token    = 100261 '<|fim_pad|>'
print_info: EOG token        = 100257 '<|end_of_text|>'
print_info: EOG token        = 100261 '<|fim_pad|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:        CUDA0 model buffer size =  9554.47 MiB
load_tensors:        CUDA1 model buffer size =  9016.47 MiB
load_tensors:          CPU model buffer size =   321.56 MiB
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.40 MiB
llama_kv_cache: the V embeddings have different sizes across layers and FA is not enabled - padding V cache to 1024
llama_kv_cache:      CUDA0 KV buffer size =    32.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =    32.00 MiB
llama_kv_cache: size =   64.00 MiB (  4096 cells,   4 layers,  1/1 seqs), K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_memory_recurrent:      CUDA0 RS buffer size =    77.84 MiB
llama_memory_recurrent:      CUDA1 RS buffer size =    69.64 MiB
llama_memory_recurrent: size =  147.48 MiB (     1 cells,  40 layers,  1 seqs), R (f32):    3.48 MiB, S (f32):  144.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 3586.40 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 3760611840
graph_reserve: failed to allocate compute buffers
Exception 0xc0000005 0x0 0x139f3a5df58 0x7ffbb352001a
PC=0x7ffbb352001a
signal arrived during external code execution

runtime.cgocall(0x7ff7c9d8db90, 0xc00035fbf8)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/cgocall.go:167 +0x3e fp=0xc00035fbd0 sp=0xc00035fb68 pc=0x7ff7c9062c7e
github.com/ollama/ollama/llama._Cfunc_llama_init_from_model(0x13dfb2dfef0, {0x1000, 0x200, 0x200, 0x1, 0x6, 0x6, 0xffffffff, 0xffffffff, 0xffffffff, ...})
	_cgo_gotypes.go:741 +0x54 fp=0xc00035fbf8 sp=0xc00035fbd0 pc=0x7ff7c94315f4
github.com/ollama/ollama/llama.NewContextWithModel.func1(...)
	C:/a/ollama/ollama/llama/llama.go:280
github.com/ollama/ollama/llama.NewContextWithModel(0xc0002f0008, {{0x1000, 0x200, 0x200, 0x1, 0x6, 0x6, 0xffffffff, 0xffffffff, 0xffffffff, ...}})
	C:/a/ollama/ollama/llama/llama.go:280 +0x158 fp=0xc00035fd98 sp=0xc00035fbf8 pc=0x7ff7c9435738
github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc00031a640, {0x29, 0x0, 0x0, {0xc000611a58, 0x2, 0x2}, 0xc000421390, 0x0}, {0xc000038150, ...}, ...)
	C:/a/ollama/ollama/runner/llamarunner/runner.go:797 +0x198 fp=0xc00035fee0 sp=0xc00035fd98 pc=0x7ff7c94f4d18
github.com/ollama/ollama/runner/llamarunner.(*Server).load.gowrap2()
	C:/a/ollama/ollama/runner/llamarunner/runner.go:879 +0x175 fp=0xc00035ffe0 sp=0xc00035fee0 pc=0x7ff7c94f5db5
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00035ffe8 sp=0xc00035ffe0 pc=0x7ff7c906d8e1
created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 25
	C:/a/ollama/ollama/runner/llamarunner/runner.go:879 +0x7ce

goroutine 1 gp=0xc0000021c0 m=nil [IO wait]:
runtime.gopark(0x7ff7c906f0e0?, 0x7ff7caec5a60?, 0x20?, 0xc0?, 0xc0002fc0cc?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00050f648 sp=0xc00050f628 pc=0x7ff7c90661ce
runtime.netpollblock(0x3dc?, 0xc9000406?, 0xf7?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/netpoll.go:575 +0xf7 fp=0xc00050f680 sp=0xc00050f648 pc=0x7ff7c902bdf7
internal/poll.runtime_pollWait(0x13df8c60d70, 0x72)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/netpoll.go:351 +0x85 fp=0xc00050f6a0 sp=0xc00050f680 pc=0x7ff7c9065365
internal/poll.(*pollDesc).wait(0x7ff7c90f9e73?, 0x0?, 0x0)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00050f6c8 sp=0xc00050f6a0 pc=0x7ff7c90fb467
internal/poll.execIO(0xc0002fc020, 0xc00050f770)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_windows.go:177 +0x105 fp=0xc00050f740 sp=0xc00050f6c8 pc=0x7ff7c90fc8c5
internal/poll.(*FD).acceptOne(0xc0002fc008, 0x3e8, {0xc0002d60f0?, 0xc00050f7d0?, 0x7ff7c9104585?}, 0xc00050f804?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_windows.go:946 +0x65 fp=0xc00050f7a0 sp=0xc00050f740 pc=0x7ff7c9100e45
internal/poll.(*FD).Accept(0xc0002fc008, 0xc00050f950)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_windows.go:980 +0x1b6 fp=0xc00050f858 sp=0xc00050f7a0 pc=0x7ff7c9101176
net.(*netFD).accept(0xc0002fc008)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/fd_windows.go:182 +0x4b fp=0xc00050f970 sp=0xc00050f858 pc=0x7ff7c917264b
net.(*TCPListener).accept(0xc0002ce100)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/tcpsock_posix.go:159 +0x1b fp=0xc00050f9c0 sp=0xc00050f970 pc=0x7ff7c918869b
net.(*TCPListener).Accept(0xc0002ce100)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/tcpsock.go:380 +0x30 fp=0xc00050f9f0 sp=0xc00050f9c0 pc=0x7ff7c9187450
net/http.(*onceCloseListener).Accept(0xc0002d4090?)
	<autogenerated>:1 +0x24 fp=0xc00050fa08 sp=0xc00050f9f0 pc=0x7ff7c93a0844
net/http.(*Server).Serve(0xc0001b8700, {0x7ff7ca534280, 0xc0002ce100})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/http/server.go:3424 +0x30c fp=0xc00050fb38 sp=0xc00050fa08 pc=0x7ff7c937810c
github.com/ollama/ollama/runner/llamarunner.Execute({0xc000118020, 0x4, 0x6})
	C:/a/ollama/ollama/runner/llamarunner/runner.go:947 +0x8f5 fp=0xc00050fd08 sp=0xc00050fb38 pc=0x7ff7c94f6775
github.com/ollama/ollama/runner.Execute({0xc000118010?, 0x0?, 0x0?})
	C:/a/ollama/ollama/runner/runner.go:22 +0xd4 fp=0xc00050fd30 sp=0xc00050fd08 pc=0x7ff7c9594a94
github.com/ollama/ollama/cmd.NewCLI.func2(0xc0000a8f00?, {0x7ff7ca34efb9?, 0x4?, 0x7ff7ca34efbd?})
	C:/a/ollama/ollama/cmd/cmd.go:1774 +0x45 fp=0xc00050fd58 sp=0xc00050fd30 pc=0x7ff7c9d1e485
github.com/spf13/cobra.(*Command).execute(0xc000477808, {0xc0002b6500, 0x4, 0x4})
	C:/Users/runneradmin/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc00050fe78 sp=0xc00050fd58 pc=0x7ff7c91ed11c
github.com/spf13/cobra.(*Command).ExecuteC(0xc000455208)
	C:/Users/runneradmin/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc00050ff30 sp=0xc00050fe78 pc=0x7ff7c91ed965
github.com/spf13/cobra.(*Command).Execute(...)
	C:/Users/runneradmin/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
	C:/Users/runneradmin/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
	C:/a/ollama/ollama/main.go:12 +0x4d fp=0xc00050ff50 sp=0xc00050ff30 pc=0x7ff7c9d1ef4d
runtime.main()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:283 +0x27d fp=0xc00050ffe0 sp=0xc00050ff50 pc=0x7ff7c9034ddd
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00050ffe8 sp=0xc00050ffe0 pc=0x7ff7c906d8e1

goroutine 2 gp=0xc0000028c0 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00006dfa8 sp=0xc00006df88 pc=0x7ff7c90661ce
runtime.goparkunlock(...)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:441
runtime.forcegchelper()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:348 +0xb8 fp=0xc00006dfe0 sp=0xc00006dfa8 pc=0x7ff7c90350f8
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00006dfe8 sp=0xc00006dfe0 pc=0x7ff7c906d8e1
created by runtime.init.7 in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:336 +0x1a

goroutine 3 gp=0xc000002c40 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00006ff80 sp=0xc00006ff60 pc=0x7ff7c90661ce
runtime.goparkunlock(...)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:441
runtime.bgsweep(0xc00007c000)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgcsweep.go:316 +0xdf fp=0xc00006ffc8 sp=0xc00006ff80 pc=0x7ff7c901debf
runtime.gcenable.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:204 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x7ff7c9012285
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x7ff7c906d8e1
created by runtime.gcenable in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000002e00 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x7ff7ca520b28?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x7ff7c90661ce
runtime.goparkunlock(...)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:441
runtime.(*scavengerState).park(0x7ff7caeec3a0)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x7ff7c901b909
runtime.bgscavenge(0xc00007c000)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x7ff7c901be99
runtime.gcenable.gowrap2()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x7ff7c9012225
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x7ff7c906d8e1
created by runtime.gcenable in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:205 +0xa5

goroutine 5 gp=0xc000003340 m=nil [finalizer wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000087e30 sp=0xc000087e10 pc=0x7ff7c90661ce
runtime.runfinq()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mfinal.go:196 +0x107 fp=0xc000087fe0 sp=0xc000087e30 pc=0x7ff7c9011207
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000087fe8 sp=0xc000087fe0 pc=0x7ff7c906d8e1
created by runtime.createfing in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mfinal.go:166 +0x3d

goroutine 6 gp=0xc000003dc0 m=nil [chan receive]:
runtime.gopark(0xc00014d860?, 0xc000588018?, 0x60?, 0x1f?, 0x7ff7c915b588?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000071f18 sp=0xc000071ef8 pc=0x7ff7c90661ce
runtime.chanrecv(0xc0000383f0, 0x0, 0x1)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/chan.go:664 +0x445 fp=0xc000071f90 sp=0xc000071f18 pc=0x7ff7c9002d45
runtime.chanrecv1(0x7ff7c9034f40?, 0xc000071f76?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/chan.go:506 +0x12 fp=0xc000071fb8 sp=0xc000071f90 pc=0x7ff7c90028d2
runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1796
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1799 +0x2f fp=0xc000071fe0 sp=0xc000071fb8 pc=0x7ff7c90154af
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000071fe8 sp=0xc000071fe0 pc=0x7ff7c906d8e1
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1794 +0x85

goroutine 7 gp=0xc0003e0380 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000081f38 sp=0xc000081f18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000081fc8 sp=0xc000081f38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000081fe0 sp=0xc000081fc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000081fe8 sp=0xc000081fe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 18 gp=0xc00008e1c0 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00009bf38 sp=0xc00009bf18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc00009bfc8 sp=0xc00009bf38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc00009bfe0 sp=0xc00009bfc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00009bfe8 sp=0xc00009bfe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 34 gp=0xc000484000 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000097f38 sp=0xc000097f18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000097fc8 sp=0xc000097f38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000097fe0 sp=0xc000097fc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000097fe8 sp=0xc000097fe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 19 gp=0xc00008e380 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00009df38 sp=0xc00009df18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc00009dfc8 sp=0xc00009df38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc00009dfe0 sp=0xc00009dfc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00009dfe8 sp=0xc00009dfe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 20 gp=0xc00008e540 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc0000a5f38 sp=0xc0000a5f18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc0000a5fc8 sp=0xc0000a5f38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc0000a5fe0 sp=0xc0000a5fc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000a5fe8 sp=0xc0000a5fe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 8 gp=0xc0003e0540 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000083f38 sp=0xc000083f18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000083fc8 sp=0xc000083f38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000083fe0 sp=0xc000083fc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000083fe8 sp=0xc000083fe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 9 gp=0xc0003e0700 m=nil [GC worker (idle)]:
runtime.gopark(0x7f814d602e0?, 0x0?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc0000a1f38 sp=0xc0000a1f18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc0000a1fc8 sp=0xc0000a1f38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc0000a1fe0 sp=0xc0000a1fc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000a1fe8 sp=0xc0000a1fe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 35 gp=0xc0004841c0 m=nil [GC worker (idle)]:
runtime.gopark(0x7f814dea378?, 0x0?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000099f38 sp=0xc000099f18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000099fc8 sp=0xc000099f38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000099fe0 sp=0xc000099fc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000099fe8 sp=0xc000099fe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 36 gp=0xc000484380 m=nil [GC worker (idle)]:
runtime.gopark(0x7f814dea378?, 0x3?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00048bf38 sp=0xc00048bf18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc00048bfc8 sp=0xc00048bf38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc00048bfe0 sp=0xc00048bfc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00048bfe8 sp=0xc00048bfe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 21 gp=0xc00008e700 m=nil [GC worker (idle)]:
runtime.gopark(0x7ff7caf3af80?, 0x1?, 0x98?, 0xa0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc0000a7f38 sp=0xc0000a7f18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc0000a7fc8 sp=0xc0000a7f38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc0000a7fe0 sp=0xc0000a7fc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000a7fe8 sp=0xc0000a7fe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 22 gp=0xc00008e8c0 m=nil [GC worker (idle)]:
runtime.gopark(0x7f814dea378?, 0x3?, 0x0?, 0x0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000487f38 sp=0xc000487f18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000487fc8 sp=0xc000487f38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000487fe0 sp=0xc000487fc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000487fe8 sp=0xc000487fe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 23 gp=0xc00008ea80 m=nil [GC worker (idle)]:
runtime.gopark(0x7ff7caf3af80?, 0x1?, 0x98?, 0xa0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000489f38 sp=0xc000489f18 pc=0x7ff7c90661ce
runtime.gcBgMarkWorker(0xc000039810)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000489fc8 sp=0xc000489f38 pc=0x7ff7c90147a9
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000489fe0 sp=0xc000489fc8 pc=0x7ff7c9014685
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000489fe8 sp=0xc000489fe0 pc=0x7ff7c906d8e1
created by runtime.gcBgMarkStartWorkers in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105

goroutine 24 gp=0xc0003e0000 m=nil [sync.WaitGroup.Wait]:
runtime.gopark(0x0?, 0x0?, 0x60?, 0xe0?, 0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00048de20 sp=0xc00048de00 pc=0x7ff7c90661ce
runtime.goparkunlock(...)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:441
runtime.semacquire1(0xc00031a660, 0x0, 0x1, 0x0, 0x18)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/sema.go:188 +0x22f fp=0xc00048de88 sp=0xc00048de20 pc=0x7ff7c904750f
sync.runtime_SemacquireWaitGroup(0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/sema.go:110 +0x25 fp=0xc00048dec0 sp=0xc00048de88 pc=0x7ff7c90677c5
sync.(*WaitGroup).Wait(0x0?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/sync/waitgroup.go:118 +0x48 fp=0xc00048dee8 sp=0xc00048dec0 pc=0x7ff7c907b7a8
github.com/ollama/ollama/runner/llamarunner.(*Server).run(0xc00031a640, {0x7ff7ca5367f0, 0xc000148fa0})
	C:/a/ollama/ollama/runner/llamarunner/runner.go:334 +0x4b fp=0xc00048dfb8 sp=0xc00048dee8 pc=0x7ff7c94f18eb
github.com/ollama/ollama/runner/llamarunner.Execute.gowrap1()
	C:/a/ollama/ollama/runner/llamarunner/runner.go:926 +0x28 fp=0xc00048dfe0 sp=0xc00048dfb8 pc=0x7ff7c94f69e8
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00048dfe8 sp=0xc00048dfe0 pc=0x7ff7c906d8e1
created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1
	C:/a/ollama/ollama/runner/llamarunner/runner.go:926 +0x4c5

goroutine 25 gp=0xc0003e01c0 m=nil [IO wait]:
runtime.gopark(0x0?, 0xc0002fc2a0?, 0x48?, 0xc3?, 0xc0002fc34c?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc0000478c8 sp=0xc0000478a8 pc=0x7ff7c90661ce
runtime.netpollblock(0x3e4?, 0xc9000406?, 0xf7?)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/netpoll.go:575 +0xf7 fp=0xc000047900 sp=0xc0000478c8 pc=0x7ff7c902bdf7
internal/poll.runtime_pollWait(0x13df8c60c58, 0x72)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/netpoll.go:351 +0x85 fp=0xc000047920 sp=0xc000047900 pc=0x7ff7c9065365
internal/poll.(*pollDesc).wait(0x7ff7c922c0f7?, 0xc000047970?, 0x0)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000047948 sp=0xc000047920 pc=0x7ff7c90fb467
internal/poll.execIO(0xc0002fc2a0, 0x7ff7ca3c5c58)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_windows.go:177 +0x105 fp=0xc0000479c0 sp=0xc000047948 pc=0x7ff7c90fc8c5
internal/poll.(*FD).Read(0xc0002fc288, {0xc0002d8000, 0x1000, 0x1000})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_windows.go:438 +0x29b fp=0xc000047a60 sp=0xc0000479c0 pc=0x7ff7c90fd59b
net.(*netFD).Read(0xc0002fc288, {0xc0002d8000?, 0xc000047ad0?, 0x7ff7c90fb925?})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/fd_posix.go:55 +0x25 fp=0xc000047aa8 sp=0xc000047a60 pc=0x7ff7c9170765
net.(*conn).Read(0xc000074220, {0xc0002d8000?, 0x0?, 0x0?})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/net.go:194 +0x45 fp=0xc000047af0 sp=0xc000047aa8 pc=0x7ff7c917fc45
net/http.(*connReader).Read(0xc00023b290, {0xc0002d8000, 0x1000, 0x1000})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/http/server.go:798 +0x159 fp=0xc000047b40 sp=0xc000047af0 pc=0x7ff7c936cfb9
bufio.(*Reader).fill(0xc000090240)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/bufio/bufio.go:113 +0x103 fp=0xc000047b78 sp=0xc000047b40 pc=0x7ff7c9196483
bufio.(*Reader).Peek(0xc000090240, 0x4)
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/bufio/bufio.go:152 +0x53 fp=0xc000047b98 sp=0xc000047b78 pc=0x7ff7c91965b3
net/http.(*conn).serve(0xc0002d4090, {0x7ff7ca5367b8, 0xc00023b140})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/http/server.go:2137 +0x785 fp=0xc000047fb8 sp=0xc000047b98 pc=0x7ff7c9372da5
net/http.(*Server).Serve.gowrap3()
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/http/server.go:3454 +0x28 fp=0xc000047fe0 sp=0xc000047fb8 pc=0x7ff7c9378508
runtime.goexit({})
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000047fe8 sp=0xc000047fe0 pc=0x7ff7c906d8e1
created by net/http.(*Server).Serve in goroutine 1
	C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/http/server.go:3454 +0x485
rax     0xffffffff8cda7ff0
rbx     0x0
rcx     0x13d8cdb4df0
rdx     0x13d8cd1db20
rdi     0x13d8f20c140
rsi     0x13d8f20bf70
rbp     0x2b228ff790
rsp     0x2b228ff538
r8      0x13db3871080
r9      0x1
r10     0x8000
r11     0x2b228ff4a0
r12     0x13d8eec8e20
r13     0x13d8cdb54d0
r14     0x4
r15     0x14d8
rip     0x7ffbb352001a
rflags  0x10202
cs      0x33
fs      0x53
gs      0x2b
time=2025-11-06T14:11:36.649+11:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server error"
time=2025-11-06T14:11:36.683+11:00 level=ERROR source=server.go:273 msg="llama runner terminated" error="exit status 2"
time=2025-11-06T14:11:36.900+11:00 level=INFO source=sched.go:446 msg="Load failed" model=C:\Users\user\.ollama\models\blobs\sha256-23c6019f2e6fa56eb8872ecf8d9d1e4b2ebf3c087fb5275257ea07602d2a38b6 error="llama runner process has terminated: cudaMalloc failed: out of memory"
[GIN] 2025/11/06 - 14:11:36 | 500 |    7.8442085s |       127.0.0.1 | POST     "/api/chat"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.12.9

Originally created by @ghost on GitHub (Nov 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12982 ### What is the issue? I'm having trouble running granite4:small-h. Is this because it's a new model and Ollama is not yet compatible with it? ### Relevant log output ```shell time=2025-11-06T14:08:46.942+11:00 level=INFO source=routes.go:1524 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\user\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-11-06T14:08:46.956+11:00 level=INFO source=images.go:522 msg="total blobs: 43" time=2025-11-06T14:08:46.958+11:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0" time=2025-11-06T14:08:46.959+11:00 level=INFO source=routes.go:1577 msg="Listening on 127.0.0.1:11434 (version 0.12.9)" time=2025-11-06T14:08:46.961+11:00 level=INFO source=runner.go:76 msg="discovering available GPUs..." time=2025-11-06T14:08:46.983+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51420" time=2025-11-06T14:08:49.743+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51431" time=2025-11-06T14:08:50.302+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51440" time=2025-11-06T14:08:50.908+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51450" time=2025-11-06T14:08:50.910+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51451" time=2025-11-06T14:08:50.910+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51452" time=2025-11-06T14:08:50.910+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51453" time=2025-11-06T14:08:51.244+11:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-3f8ecd83-47d0-81a2-ed3c-5c41dd1507fc filtered_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3060" libdirs=ollama,cuda_v13 driver=13.0 pci_id=0000:15:00.0 type=discrete total="12.0 GiB" available="11.8 GiB" time=2025-11-06T14:08:51.244+11:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-cbe06948-85e0-6899-f43f-21e1865c0283 filtered_id="" library=CUDA compute=8.6 name=CUDA1 description="NVIDIA GeForce RTX 3060" libdirs=ollama,cuda_v13 driver=13.0 pci_id=0000:21:00.0 type=discrete total="12.0 GiB" available="11.5 GiB" [GIN] 2025/11/06 - 14:08:51 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2025/11/06 - 14:08:51 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2025/11/06 - 14:08:51 | 200 | 8.0922ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/11/06 - 14:08:51 | 404 | 9.2379ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/11/06 - 14:08:53 | 404 | 7.2052ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/11/06 - 14:08:56 | 200 | 10.9063ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/11/06 - 14:08:56 | 200 | 620.8µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/11/06 - 14:08:57 | 404 | 8.5511ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/11/06 - 14:09:21 | 200 | 62.1µs | 127.0.0.1 | GET "/api/version" [GIN] 2025/11/06 - 14:09:21 | 200 | 9.8435ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/11/06 - 14:09:40 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/11/06 - 14:09:40 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/11/06 - 14:11:07 | 200 | 10.3175ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/11/06 - 14:11:07 | 200 | 546.5µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/11/06 - 14:11:12 | 200 | 10.2099ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/11/06 - 14:11:12 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/11/06 - 14:11:20 | 200 | 9.6292ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/11/06 - 14:11:20 | 200 | 541.9µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/11/06 - 14:11:23 | 200 | 11.8044ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/11/06 - 14:11:23 | 200 | 588.8µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/11/06 - 14:11:24 | 200 | 0s | 127.0.0.1 | GET "/api/version" time=2025-11-06T14:11:29.257+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 51503" time=2025-11-06T14:11:29.821+11:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-11-06T14:11:29.821+11:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=6 efficiency=0 threads=12 llama_model_loader: loaded meta data with 43 key-value pairs and 666 tensors from C:\Users\user\.ollama\models\blobs\sha256-23c6019f2e6fa56eb8872ecf8d9d1e4b2ebf3c087fb5275257ea07602d2a38b6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = granitehybrid llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Granite 4.0 H Small llama_model_loader: - kv 3: general.basename str = granite-4.0-h llama_model_loader: - kv 4: general.size_label str = small llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: general.tags arr[str,2] = ["language", "granite-4.0"] llama_model_loader: - kv 7: granitehybrid.block_count u32 = 40 llama_model_loader: - kv 8: granitehybrid.context_length u32 = 1048576 llama_model_loader: - kv 9: granitehybrid.embedding_length u32 = 4096 llama_model_loader: - kv 10: granitehybrid.feed_forward_length u32 = 768 llama_model_loader: - kv 11: granitehybrid.attention.head_count u32 = 32 llama_model_loader: - kv 12: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, ... llama_model_loader: - kv 13: granitehybrid.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 14: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 15: granitehybrid.expert_count u32 = 72 llama_model_loader: - kv 16: granitehybrid.expert_used_count u32 = 10 llama_model_loader: - kv 17: granitehybrid.vocab_size u32 = 100352 llama_model_loader: - kv 18: granitehybrid.rope.dimension_count u32 = 128 llama_model_loader: - kv 19: granitehybrid.attention.scale f32 = 0.007813 llama_model_loader: - kv 20: granitehybrid.embedding_scale f32 = 12.000000 llama_model_loader: - kv 21: granitehybrid.residual_scale f32 = 0.220000 llama_model_loader: - kv 22: granitehybrid.logit_scale f32 = 16.000000 llama_model_loader: - kv 23: granitehybrid.expert_shared_feed_forward_length u32 = 1536 llama_model_loader: - kv 24: granitehybrid.ssm.conv_kernel u32 = 4 llama_model_loader: - kv 25: granitehybrid.ssm.state_size u32 = 128 llama_model_loader: - kv 26: granitehybrid.ssm.group_count u32 = 1 llama_model_loader: - kv 27: granitehybrid.ssm.inner_size u32 = 8192 llama_model_loader: - kv 28: granitehybrid.ssm.time_step_rank u32 = 128 llama_model_loader: - kv 29: granitehybrid.rope.scaling.finetuned bool = false llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 31: tokenizer.ggml.pre str = dbrx llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 35: tokenizer.ggml.bos_token_id u32 = 100257 llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 100257 llama_model_loader: - kv 37: tokenizer.ggml.unknown_token_id u32 = 100269 llama_model_loader: - kv 38: tokenizer.ggml.padding_token_id u32 = 100256 llama_model_loader: - kv 39: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 40: tokenizer.chat_template str = {%- set tools_system_message_prefix =... llama_model_loader: - kv 41: general.quantization_version u32 = 2 llama_model_loader: - kv 42: general.file_type u32 = 15 llama_model_loader: - type f32: 337 tensors llama_model_loader: - type q4_K: 286 tensors llama_model_loader: - type q6_K: 43 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.14 GiB (4.84 BPW) load: printing all EOG tokens: load: - 100257 ('<|end_of_text|>') load: - 100261 ('<|fim_pad|>') load: special tokens cache size = 96 load: token to piece cache size = 0.6152 MB print_info: arch = granitehybrid print_info: vocab_only = 1 print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_n_group = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = ?B print_info: model params = 32.21 B print_info: general.name = Granite 4.0 H Small print_info: f_embedding_scale = 0.000000 print_info: f_residual_scale = 0.000000 print_info: f_attention_scale = 0.000000 print_info: n_ff_shexp = 0 print_info: vocab type = BPE print_info: n_vocab = 100352 print_info: n_merges = 100000 print_info: BOS token = 100257 '<|end_of_text|>' print_info: EOS token = 100257 '<|end_of_text|>' print_info: EOT token = 100257 '<|end_of_text|>' print_info: UNK token = 100269 '<|unk|>' print_info: PAD token = 100256 '<|pad|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 100258 '<|fim_prefix|>' print_info: FIM SUF token = 100260 '<|fim_suffix|>' print_info: FIM MID token = 100259 '<|fim_middle|>' print_info: FIM PAD token = 100261 '<|fim_pad|>' print_info: EOG token = 100257 '<|end_of_text|>' print_info: EOG token = 100261 '<|fim_pad|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-11-06T14:11:30.106+11:00 level=INFO source=server.go:400 msg="starting runner" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\user\\.ollama\\models\\blobs\\sha256-23c6019f2e6fa56eb8872ecf8d9d1e4b2ebf3c087fb5275257ea07602d2a38b6 --port 51513" time=2025-11-06T14:11:30.112+11:00 level=INFO source=server.go:470 msg="system memory" total="127.7 GiB" free="117.0 GiB" free_swap="134.4 GiB" time=2025-11-06T14:11:30.113+11:00 level=INFO source=memory.go:37 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=C:\Users\user\.ollama\models\blobs\sha256-23c6019f2e6fa56eb8872ecf8d9d1e4b2ebf3c087fb5275257ea07602d2a38b6 library=CUDA parallel=1 required="22.4 GiB" gpus=2 time=2025-11-06T14:11:30.114+11:00 level=INFO source=server.go:522 msg=offload library=CUDA layers.requested=-1 layers.model=41 layers.offload=41 layers.split="[21 20]" memory.available="[11.8 GiB 11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.4 GiB" memory.required.partial="22.4 GiB" memory.required.kv="211.5 MiB" memory.required.allocations="[11.4 GiB 11.0 GiB]" memory.weights.total="18.1 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="321.6 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" time=2025-11-06T14:11:30.183+11:00 level=INFO source=runner.go:910 msg="starting go runner" load_backend: loaded CPU backend from C:\Users\user\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-3f8ecd83-47d0-81a2-ed3c-5c41dd1507fc Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-cbe06948-85e0-6899-f43f-21e1865c0283 load_backend: loaded CUDA backend from C:\Users\user\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2025-11-06T14:11:30.334+11:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-11-06T14:11:30.335+11:00 level=INFO source=runner.go:946 msg="Server listening on 127.0.0.1:51513" time=2025-11-06T14:11:30.344+11:00 level=INFO source=runner.go:845 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:6 GPULayers:41[ID:GPU-3f8ecd83-47d0-81a2-ed3c-5c41dd1507fc Layers:21(0..20) ID:GPU-cbe06948-85e0-6899-f43f-21e1865c0283 Layers:20(21..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-06T14:11:30.345+11:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-11-06T14:11:30.345+11:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" ggml_backend_cuda_device_get_memory device GPU-3f8ecd83-47d0-81a2-ed3c-5c41dd1507fc utilizing NVML memory reporting free: 12689317888 total: 12884901888 llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) (0000:15:00.0) - 12101 MiB free ggml_backend_cuda_device_get_memory device GPU-cbe06948-85e0-6899-f43f-21e1865c0283 utilizing NVML memory reporting free: 12365746176 total: 12884901888 llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3060) (0000:21:00.0) - 11792 MiB free llama_model_loader: loaded meta data with 43 key-value pairs and 666 tensors from C:\Users\user\.ollama\models\blobs\sha256-23c6019f2e6fa56eb8872ecf8d9d1e4b2ebf3c087fb5275257ea07602d2a38b6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = granitehybrid llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Granite 4.0 H Small llama_model_loader: - kv 3: general.basename str = granite-4.0-h llama_model_loader: - kv 4: general.size_label str = small llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: general.tags arr[str,2] = ["language", "granite-4.0"] llama_model_loader: - kv 7: granitehybrid.block_count u32 = 40 llama_model_loader: - kv 8: granitehybrid.context_length u32 = 1048576 llama_model_loader: - kv 9: granitehybrid.embedding_length u32 = 4096 llama_model_loader: - kv 10: granitehybrid.feed_forward_length u32 = 768 llama_model_loader: - kv 11: granitehybrid.attention.head_count u32 = 32 llama_model_loader: - kv 12: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, ... llama_model_loader: - kv 13: granitehybrid.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 14: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 15: granitehybrid.expert_count u32 = 72 llama_model_loader: - kv 16: granitehybrid.expert_used_count u32 = 10 llama_model_loader: - kv 17: granitehybrid.vocab_size u32 = 100352 llama_model_loader: - kv 18: granitehybrid.rope.dimension_count u32 = 128 llama_model_loader: - kv 19: granitehybrid.attention.scale f32 = 0.007813 llama_model_loader: - kv 20: granitehybrid.embedding_scale f32 = 12.000000 llama_model_loader: - kv 21: granitehybrid.residual_scale f32 = 0.220000 llama_model_loader: - kv 22: granitehybrid.logit_scale f32 = 16.000000 llama_model_loader: - kv 23: granitehybrid.expert_shared_feed_forward_length u32 = 1536 llama_model_loader: - kv 24: granitehybrid.ssm.conv_kernel u32 = 4 llama_model_loader: - kv 25: granitehybrid.ssm.state_size u32 = 128 llama_model_loader: - kv 26: granitehybrid.ssm.group_count u32 = 1 llama_model_loader: - kv 27: granitehybrid.ssm.inner_size u32 = 8192 llama_model_loader: - kv 28: granitehybrid.ssm.time_step_rank u32 = 128 llama_model_loader: - kv 29: granitehybrid.rope.scaling.finetuned bool = false llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 31: tokenizer.ggml.pre str = dbrx llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 35: tokenizer.ggml.bos_token_id u32 = 100257 llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 100257 llama_model_loader: - kv 37: tokenizer.ggml.unknown_token_id u32 = 100269 llama_model_loader: - kv 38: tokenizer.ggml.padding_token_id u32 = 100256 llama_model_loader: - kv 39: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 40: tokenizer.chat_template str = {%- set tools_system_message_prefix =... llama_model_loader: - kv 41: general.quantization_version u32 = 2 llama_model_loader: - kv 42: general.file_type u32 = 15 llama_model_loader: - type f32: 337 tensors llama_model_loader: - type q4_K: 286 tensors llama_model_loader: - type q6_K: 43 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.14 GiB (4.84 BPW) load: printing all EOG tokens: load: - 100257 ('<|end_of_text|>') load: - 100261 ('<|fim_pad|>') load: special tokens cache size = 96 load: token to piece cache size = 0.6152 MB print_info: arch = granitehybrid print_info: vocab_only = 0 print_info: n_ctx_train = 1048576 print_info: n_embd = 4096 print_info: n_layer = 40 print_info: n_head = 32 print_info: n_head_kv = [0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0] print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0] print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0] print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0] print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 1.6e+01 print_info: f_attn_scale = 7.8e-03 print_info: n_ff = 768 print_info: n_expert = 72 print_info: n_expert_used = 10 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 1048576 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 4 print_info: ssm_d_inner = 8192 print_info: ssm_d_state = 128 print_info: ssm_dt_rank = 128 print_info: ssm_n_group = 1 print_info: ssm_dt_b_c_rms = 0 print_info: model type = ?B print_info: model params = 32.21 B print_info: general.name = Granite 4.0 H Small print_info: f_embedding_scale = 12.000000 print_info: f_residual_scale = 0.220000 print_info: f_attention_scale = 0.007813 print_info: n_ff_shexp = 1536 print_info: vocab type = BPE print_info: n_vocab = 100352 print_info: n_merges = 100000 print_info: BOS token = 100257 '<|end_of_text|>' print_info: EOS token = 100257 '<|end_of_text|>' print_info: EOT token = 100257 '<|end_of_text|>' print_info: UNK token = 100269 '<|unk|>' print_info: PAD token = 100256 '<|pad|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 100258 '<|fim_prefix|>' print_info: FIM SUF token = 100260 '<|fim_suffix|>' print_info: FIM MID token = 100259 '<|fim_middle|>' print_info: FIM PAD token = 100261 '<|fim_pad|>' print_info: EOG token = 100257 '<|end_of_text|>' print_info: EOG token = 100261 '<|fim_pad|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 40 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 41/41 layers to GPU load_tensors: CUDA0 model buffer size = 9554.47 MiB load_tensors: CUDA1 model buffer size = 9016.47 MiB load_tensors: CPU model buffer size = 321.56 MiB llama_init_from_model: model default pooling_type is [0], but [-1] was specified llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = disabled llama_context: kv_unified = false llama_context: freq_base = 10000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 0.40 MiB llama_kv_cache: the V embeddings have different sizes across layers and FA is not enabled - padding V cache to 1024 llama_kv_cache: CUDA0 KV buffer size = 32.00 MiB llama_kv_cache: CUDA1 KV buffer size = 32.00 MiB llama_kv_cache: size = 64.00 MiB ( 4096 cells, 4 layers, 1/1 seqs), K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_memory_recurrent: CUDA0 RS buffer size = 77.84 MiB llama_memory_recurrent: CUDA1 RS buffer size = 69.64 MiB llama_memory_recurrent: size = 147.48 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 3.48 MiB, S (f32): 144.00 MiB llama_context: pipeline parallelism enabled (n_copies=4) ggml_backend_cuda_buffer_type_alloc_buffer: allocating 3586.40 MiB on device 1: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 3760611840 graph_reserve: failed to allocate compute buffers Exception 0xc0000005 0x0 0x139f3a5df58 0x7ffbb352001a PC=0x7ffbb352001a signal arrived during external code execution runtime.cgocall(0x7ff7c9d8db90, 0xc00035fbf8) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/cgocall.go:167 +0x3e fp=0xc00035fbd0 sp=0xc00035fb68 pc=0x7ff7c9062c7e github.com/ollama/ollama/llama._Cfunc_llama_init_from_model(0x13dfb2dfef0, {0x1000, 0x200, 0x200, 0x1, 0x6, 0x6, 0xffffffff, 0xffffffff, 0xffffffff, ...}) _cgo_gotypes.go:741 +0x54 fp=0xc00035fbf8 sp=0xc00035fbd0 pc=0x7ff7c94315f4 github.com/ollama/ollama/llama.NewContextWithModel.func1(...) C:/a/ollama/ollama/llama/llama.go:280 github.com/ollama/ollama/llama.NewContextWithModel(0xc0002f0008, {{0x1000, 0x200, 0x200, 0x1, 0x6, 0x6, 0xffffffff, 0xffffffff, 0xffffffff, ...}}) C:/a/ollama/ollama/llama/llama.go:280 +0x158 fp=0xc00035fd98 sp=0xc00035fbf8 pc=0x7ff7c9435738 github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc00031a640, {0x29, 0x0, 0x0, {0xc000611a58, 0x2, 0x2}, 0xc000421390, 0x0}, {0xc000038150, ...}, ...) C:/a/ollama/ollama/runner/llamarunner/runner.go:797 +0x198 fp=0xc00035fee0 sp=0xc00035fd98 pc=0x7ff7c94f4d18 github.com/ollama/ollama/runner/llamarunner.(*Server).load.gowrap2() C:/a/ollama/ollama/runner/llamarunner/runner.go:879 +0x175 fp=0xc00035ffe0 sp=0xc00035fee0 pc=0x7ff7c94f5db5 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00035ffe8 sp=0xc00035ffe0 pc=0x7ff7c906d8e1 created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 25 C:/a/ollama/ollama/runner/llamarunner/runner.go:879 +0x7ce goroutine 1 gp=0xc0000021c0 m=nil [IO wait]: runtime.gopark(0x7ff7c906f0e0?, 0x7ff7caec5a60?, 0x20?, 0xc0?, 0xc0002fc0cc?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00050f648 sp=0xc00050f628 pc=0x7ff7c90661ce runtime.netpollblock(0x3dc?, 0xc9000406?, 0xf7?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/netpoll.go:575 +0xf7 fp=0xc00050f680 sp=0xc00050f648 pc=0x7ff7c902bdf7 internal/poll.runtime_pollWait(0x13df8c60d70, 0x72) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/netpoll.go:351 +0x85 fp=0xc00050f6a0 sp=0xc00050f680 pc=0x7ff7c9065365 internal/poll.(*pollDesc).wait(0x7ff7c90f9e73?, 0x0?, 0x0) C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00050f6c8 sp=0xc00050f6a0 pc=0x7ff7c90fb467 internal/poll.execIO(0xc0002fc020, 0xc00050f770) C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_windows.go:177 +0x105 fp=0xc00050f740 sp=0xc00050f6c8 pc=0x7ff7c90fc8c5 internal/poll.(*FD).acceptOne(0xc0002fc008, 0x3e8, {0xc0002d60f0?, 0xc00050f7d0?, 0x7ff7c9104585?}, 0xc00050f804?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_windows.go:946 +0x65 fp=0xc00050f7a0 sp=0xc00050f740 pc=0x7ff7c9100e45 internal/poll.(*FD).Accept(0xc0002fc008, 0xc00050f950) C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_windows.go:980 +0x1b6 fp=0xc00050f858 sp=0xc00050f7a0 pc=0x7ff7c9101176 net.(*netFD).accept(0xc0002fc008) C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/fd_windows.go:182 +0x4b fp=0xc00050f970 sp=0xc00050f858 pc=0x7ff7c917264b net.(*TCPListener).accept(0xc0002ce100) C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/tcpsock_posix.go:159 +0x1b fp=0xc00050f9c0 sp=0xc00050f970 pc=0x7ff7c918869b net.(*TCPListener).Accept(0xc0002ce100) C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/tcpsock.go:380 +0x30 fp=0xc00050f9f0 sp=0xc00050f9c0 pc=0x7ff7c9187450 net/http.(*onceCloseListener).Accept(0xc0002d4090?) <autogenerated>:1 +0x24 fp=0xc00050fa08 sp=0xc00050f9f0 pc=0x7ff7c93a0844 net/http.(*Server).Serve(0xc0001b8700, {0x7ff7ca534280, 0xc0002ce100}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/http/server.go:3424 +0x30c fp=0xc00050fb38 sp=0xc00050fa08 pc=0x7ff7c937810c github.com/ollama/ollama/runner/llamarunner.Execute({0xc000118020, 0x4, 0x6}) C:/a/ollama/ollama/runner/llamarunner/runner.go:947 +0x8f5 fp=0xc00050fd08 sp=0xc00050fb38 pc=0x7ff7c94f6775 github.com/ollama/ollama/runner.Execute({0xc000118010?, 0x0?, 0x0?}) C:/a/ollama/ollama/runner/runner.go:22 +0xd4 fp=0xc00050fd30 sp=0xc00050fd08 pc=0x7ff7c9594a94 github.com/ollama/ollama/cmd.NewCLI.func2(0xc0000a8f00?, {0x7ff7ca34efb9?, 0x4?, 0x7ff7ca34efbd?}) C:/a/ollama/ollama/cmd/cmd.go:1774 +0x45 fp=0xc00050fd58 sp=0xc00050fd30 pc=0x7ff7c9d1e485 github.com/spf13/cobra.(*Command).execute(0xc000477808, {0xc0002b6500, 0x4, 0x4}) C:/Users/runneradmin/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc00050fe78 sp=0xc00050fd58 pc=0x7ff7c91ed11c github.com/spf13/cobra.(*Command).ExecuteC(0xc000455208) C:/Users/runneradmin/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc00050ff30 sp=0xc00050fe78 pc=0x7ff7c91ed965 github.com/spf13/cobra.(*Command).Execute(...) C:/Users/runneradmin/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992 github.com/spf13/cobra.(*Command).ExecuteContext(...) C:/Users/runneradmin/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:985 main.main() C:/a/ollama/ollama/main.go:12 +0x4d fp=0xc00050ff50 sp=0xc00050ff30 pc=0x7ff7c9d1ef4d runtime.main() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:283 +0x27d fp=0xc00050ffe0 sp=0xc00050ff50 pc=0x7ff7c9034ddd runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00050ffe8 sp=0xc00050ffe0 pc=0x7ff7c906d8e1 goroutine 2 gp=0xc0000028c0 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00006dfa8 sp=0xc00006df88 pc=0x7ff7c90661ce runtime.goparkunlock(...) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:441 runtime.forcegchelper() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:348 +0xb8 fp=0xc00006dfe0 sp=0xc00006dfa8 pc=0x7ff7c90350f8 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00006dfe8 sp=0xc00006dfe0 pc=0x7ff7c906d8e1 created by runtime.init.7 in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:336 +0x1a goroutine 3 gp=0xc000002c40 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00006ff80 sp=0xc00006ff60 pc=0x7ff7c90661ce runtime.goparkunlock(...) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:441 runtime.bgsweep(0xc00007c000) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgcsweep.go:316 +0xdf fp=0xc00006ffc8 sp=0xc00006ff80 pc=0x7ff7c901debf runtime.gcenable.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:204 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x7ff7c9012285 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x7ff7c906d8e1 created by runtime.gcenable in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000002e00 m=nil [GC scavenge wait]: runtime.gopark(0x10000?, 0x7ff7ca520b28?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x7ff7c90661ce runtime.goparkunlock(...) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:441 runtime.(*scavengerState).park(0x7ff7caeec3a0) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x7ff7c901b909 runtime.bgscavenge(0xc00007c000) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x7ff7c901be99 runtime.gcenable.gowrap2() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x7ff7c9012225 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x7ff7c906d8e1 created by runtime.gcenable in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:205 +0xa5 goroutine 5 gp=0xc000003340 m=nil [finalizer wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000087e30 sp=0xc000087e10 pc=0x7ff7c90661ce runtime.runfinq() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mfinal.go:196 +0x107 fp=0xc000087fe0 sp=0xc000087e30 pc=0x7ff7c9011207 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000087fe8 sp=0xc000087fe0 pc=0x7ff7c906d8e1 created by runtime.createfing in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mfinal.go:166 +0x3d goroutine 6 gp=0xc000003dc0 m=nil [chan receive]: runtime.gopark(0xc00014d860?, 0xc000588018?, 0x60?, 0x1f?, 0x7ff7c915b588?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000071f18 sp=0xc000071ef8 pc=0x7ff7c90661ce runtime.chanrecv(0xc0000383f0, 0x0, 0x1) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/chan.go:664 +0x445 fp=0xc000071f90 sp=0xc000071f18 pc=0x7ff7c9002d45 runtime.chanrecv1(0x7ff7c9034f40?, 0xc000071f76?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/chan.go:506 +0x12 fp=0xc000071fb8 sp=0xc000071f90 pc=0x7ff7c90028d2 runtime.unique_runtime_registerUniqueMapCleanup.func2(...) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1796 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1799 +0x2f fp=0xc000071fe0 sp=0xc000071fb8 pc=0x7ff7c90154af runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000071fe8 sp=0xc000071fe0 pc=0x7ff7c906d8e1 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1794 +0x85 goroutine 7 gp=0xc0003e0380 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000081f38 sp=0xc000081f18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000081fc8 sp=0xc000081f38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000081fe0 sp=0xc000081fc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000081fe8 sp=0xc000081fe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 18 gp=0xc00008e1c0 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00009bf38 sp=0xc00009bf18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc00009bfc8 sp=0xc00009bf38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc00009bfe0 sp=0xc00009bfc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00009bfe8 sp=0xc00009bfe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 34 gp=0xc000484000 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000097f38 sp=0xc000097f18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000097fc8 sp=0xc000097f38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000097fe0 sp=0xc000097fc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000097fe8 sp=0xc000097fe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 19 gp=0xc00008e380 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00009df38 sp=0xc00009df18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc00009dfc8 sp=0xc00009df38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc00009dfe0 sp=0xc00009dfc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00009dfe8 sp=0xc00009dfe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 20 gp=0xc00008e540 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc0000a5f38 sp=0xc0000a5f18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc0000a5fc8 sp=0xc0000a5f38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc0000a5fe0 sp=0xc0000a5fc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000a5fe8 sp=0xc0000a5fe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 8 gp=0xc0003e0540 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000083f38 sp=0xc000083f18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000083fc8 sp=0xc000083f38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000083fe0 sp=0xc000083fc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000083fe8 sp=0xc000083fe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 9 gp=0xc0003e0700 m=nil [GC worker (idle)]: runtime.gopark(0x7f814d602e0?, 0x0?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc0000a1f38 sp=0xc0000a1f18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc0000a1fc8 sp=0xc0000a1f38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc0000a1fe0 sp=0xc0000a1fc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000a1fe8 sp=0xc0000a1fe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 35 gp=0xc0004841c0 m=nil [GC worker (idle)]: runtime.gopark(0x7f814dea378?, 0x0?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000099f38 sp=0xc000099f18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000099fc8 sp=0xc000099f38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000099fe0 sp=0xc000099fc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000099fe8 sp=0xc000099fe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 36 gp=0xc000484380 m=nil [GC worker (idle)]: runtime.gopark(0x7f814dea378?, 0x3?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00048bf38 sp=0xc00048bf18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc00048bfc8 sp=0xc00048bf38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc00048bfe0 sp=0xc00048bfc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00048bfe8 sp=0xc00048bfe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 21 gp=0xc00008e700 m=nil [GC worker (idle)]: runtime.gopark(0x7ff7caf3af80?, 0x1?, 0x98?, 0xa0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc0000a7f38 sp=0xc0000a7f18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc0000a7fc8 sp=0xc0000a7f38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc0000a7fe0 sp=0xc0000a7fc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000a7fe8 sp=0xc0000a7fe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 22 gp=0xc00008e8c0 m=nil [GC worker (idle)]: runtime.gopark(0x7f814dea378?, 0x3?, 0x0?, 0x0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000487f38 sp=0xc000487f18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000487fc8 sp=0xc000487f38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000487fe0 sp=0xc000487fc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000487fe8 sp=0xc000487fe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 23 gp=0xc00008ea80 m=nil [GC worker (idle)]: runtime.gopark(0x7ff7caf3af80?, 0x1?, 0x98?, 0xa0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc000489f38 sp=0xc000489f18 pc=0x7ff7c90661ce runtime.gcBgMarkWorker(0xc000039810) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1423 +0xe9 fp=0xc000489fc8 sp=0xc000489f38 pc=0x7ff7c90147a9 runtime.gcBgMarkStartWorkers.gowrap1() C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x25 fp=0xc000489fe0 sp=0xc000489fc8 pc=0x7ff7c9014685 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000489fe8 sp=0xc000489fe0 pc=0x7ff7c906d8e1 created by runtime.gcBgMarkStartWorkers in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/mgc.go:1339 +0x105 goroutine 24 gp=0xc0003e0000 m=nil [sync.WaitGroup.Wait]: runtime.gopark(0x0?, 0x0?, 0x60?, 0xe0?, 0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc00048de20 sp=0xc00048de00 pc=0x7ff7c90661ce runtime.goparkunlock(...) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:441 runtime.semacquire1(0xc00031a660, 0x0, 0x1, 0x0, 0x18) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/sema.go:188 +0x22f fp=0xc00048de88 sp=0xc00048de20 pc=0x7ff7c904750f sync.runtime_SemacquireWaitGroup(0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/sema.go:110 +0x25 fp=0xc00048dec0 sp=0xc00048de88 pc=0x7ff7c90677c5 sync.(*WaitGroup).Wait(0x0?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/sync/waitgroup.go:118 +0x48 fp=0xc00048dee8 sp=0xc00048dec0 pc=0x7ff7c907b7a8 github.com/ollama/ollama/runner/llamarunner.(*Server).run(0xc00031a640, {0x7ff7ca5367f0, 0xc000148fa0}) C:/a/ollama/ollama/runner/llamarunner/runner.go:334 +0x4b fp=0xc00048dfb8 sp=0xc00048dee8 pc=0x7ff7c94f18eb github.com/ollama/ollama/runner/llamarunner.Execute.gowrap1() C:/a/ollama/ollama/runner/llamarunner/runner.go:926 +0x28 fp=0xc00048dfe0 sp=0xc00048dfb8 pc=0x7ff7c94f69e8 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00048dfe8 sp=0xc00048dfe0 pc=0x7ff7c906d8e1 created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1 C:/a/ollama/ollama/runner/llamarunner/runner.go:926 +0x4c5 goroutine 25 gp=0xc0003e01c0 m=nil [IO wait]: runtime.gopark(0x0?, 0xc0002fc2a0?, 0x48?, 0xc3?, 0xc0002fc34c?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/proc.go:435 +0xce fp=0xc0000478c8 sp=0xc0000478a8 pc=0x7ff7c90661ce runtime.netpollblock(0x3e4?, 0xc9000406?, 0xf7?) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/netpoll.go:575 +0xf7 fp=0xc000047900 sp=0xc0000478c8 pc=0x7ff7c902bdf7 internal/poll.runtime_pollWait(0x13df8c60c58, 0x72) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/netpoll.go:351 +0x85 fp=0xc000047920 sp=0xc000047900 pc=0x7ff7c9065365 internal/poll.(*pollDesc).wait(0x7ff7c922c0f7?, 0xc000047970?, 0x0) C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000047948 sp=0xc000047920 pc=0x7ff7c90fb467 internal/poll.execIO(0xc0002fc2a0, 0x7ff7ca3c5c58) C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_windows.go:177 +0x105 fp=0xc0000479c0 sp=0xc000047948 pc=0x7ff7c90fc8c5 internal/poll.(*FD).Read(0xc0002fc288, {0xc0002d8000, 0x1000, 0x1000}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/internal/poll/fd_windows.go:438 +0x29b fp=0xc000047a60 sp=0xc0000479c0 pc=0x7ff7c90fd59b net.(*netFD).Read(0xc0002fc288, {0xc0002d8000?, 0xc000047ad0?, 0x7ff7c90fb925?}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/fd_posix.go:55 +0x25 fp=0xc000047aa8 sp=0xc000047a60 pc=0x7ff7c9170765 net.(*conn).Read(0xc000074220, {0xc0002d8000?, 0x0?, 0x0?}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/net.go:194 +0x45 fp=0xc000047af0 sp=0xc000047aa8 pc=0x7ff7c917fc45 net/http.(*connReader).Read(0xc00023b290, {0xc0002d8000, 0x1000, 0x1000}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/http/server.go:798 +0x159 fp=0xc000047b40 sp=0xc000047af0 pc=0x7ff7c936cfb9 bufio.(*Reader).fill(0xc000090240) C:/hostedtoolcache/windows/go/1.24.0/x64/src/bufio/bufio.go:113 +0x103 fp=0xc000047b78 sp=0xc000047b40 pc=0x7ff7c9196483 bufio.(*Reader).Peek(0xc000090240, 0x4) C:/hostedtoolcache/windows/go/1.24.0/x64/src/bufio/bufio.go:152 +0x53 fp=0xc000047b98 sp=0xc000047b78 pc=0x7ff7c91965b3 net/http.(*conn).serve(0xc0002d4090, {0x7ff7ca5367b8, 0xc00023b140}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/http/server.go:2137 +0x785 fp=0xc000047fb8 sp=0xc000047b98 pc=0x7ff7c9372da5 net/http.(*Server).Serve.gowrap3() C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/http/server.go:3454 +0x28 fp=0xc000047fe0 sp=0xc000047fb8 pc=0x7ff7c9378508 runtime.goexit({}) C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000047fe8 sp=0xc000047fe0 pc=0x7ff7c906d8e1 created by net/http.(*Server).Serve in goroutine 1 C:/hostedtoolcache/windows/go/1.24.0/x64/src/net/http/server.go:3454 +0x485 rax 0xffffffff8cda7ff0 rbx 0x0 rcx 0x13d8cdb4df0 rdx 0x13d8cd1db20 rdi 0x13d8f20c140 rsi 0x13d8f20bf70 rbp 0x2b228ff790 rsp 0x2b228ff538 r8 0x13db3871080 r9 0x1 r10 0x8000 r11 0x2b228ff4a0 r12 0x13d8eec8e20 r13 0x13d8cdb54d0 r14 0x4 r15 0x14d8 rip 0x7ffbb352001a rflags 0x10202 cs 0x33 fs 0x53 gs 0x2b time=2025-11-06T14:11:36.649+11:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server error" time=2025-11-06T14:11:36.683+11:00 level=ERROR source=server.go:273 msg="llama runner terminated" error="exit status 2" time=2025-11-06T14:11:36.900+11:00 level=INFO source=sched.go:446 msg="Load failed" model=C:\Users\user\.ollama\models\blobs\sha256-23c6019f2e6fa56eb8872ecf8d9d1e4b2ebf3c087fb5275257ea07602d2a38b6 error="llama runner process has terminated: cudaMalloc failed: out of memory" [GIN] 2025/11/06 - 14:11:36 | 500 | 7.8442085s | 127.0.0.1 | POST "/api/chat" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.12.9
GiteaMirror added the bug label 2026-04-29 08:21:34 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 6, 2025):

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 3586.40 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 3760611840
graph_reserve: failed to allocate compute buffers

granite is running on the old engine which is not as accurate with memory estimation as the new ollama engine. Until the model is migrated, you can mitigate OOM issues using some of the methods show here.

<!-- gh-comment-id:3497669357 --> @rick-github commented on GitHub (Nov 6, 2025): ``` ggml_backend_cuda_buffer_type_alloc_buffer: allocating 3586.40 MiB on device 1: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 3760611840 graph_reserve: failed to allocate compute buffers ``` granite is running on the old engine which is not as accurate with memory estimation as the new ollama engine. Until the model is migrated, you can mitigate OOM issues using some of the methods show [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).
Author
Owner

@ghost commented on GitHub (Nov 6, 2025):

Thank you - this info helped.

OLLAMA_FLASH_ATTENTION=1 didn’t fix.
OLLAMA_NUM_PARALLEL=1 didn’t fix.

Setting num_gpu to 0 and creating a granite4:cpu copy worked.

It’s a lot slower than viz. qwen3-coder:30b, qwen3:30b and gemma3:27b which run 100% on gpu but it’s acceptable (response 6.31t/s and prompt 12.14t/s).

Shall I keep this ticket open until granite4 runs properly on gpu?

<!-- gh-comment-id:3498251052 --> @ghost commented on GitHub (Nov 6, 2025): Thank you - this info helped. OLLAMA_FLASH_ATTENTION=1 didn’t fix. OLLAMA_NUM_PARALLEL=1 didn’t fix. Setting num_gpu to 0 and creating a granite4:cpu copy worked. It’s a lot slower than viz. qwen3-coder:30b, qwen3:30b and gemma3:27b which run 100% on gpu but it’s acceptable (response 6.31t/s and prompt 12.14t/s). Shall I keep this ticket open until granite4 runs properly on gpu?
Author
Owner

@rick-github commented on GitHub (Nov 6, 2025):

You can configure the model to run partially on GPU. For example, set OLLAMA_GPU_OVERHEAD, or choose a value for num_gpu that is a few layers less than the full model: 38 or 39.

<!-- gh-comment-id:3498274432 --> @rick-github commented on GitHub (Nov 6, 2025): You can configure the model to run partially on GPU. For example, set `OLLAMA_GPU_OVERHEAD`, or choose a value for `num_gpu` that is a few layers less than the full model: 38 or 39.
Author
Owner

@ghost commented on GitHub (Nov 6, 2025):

Thank you! The magic number was 40 (41 crashes). 40 gives 1%/99% CPU/GPU and runs fast 💨

<!-- gh-comment-id:3498578284 --> @ghost commented on GitHub (Nov 6, 2025): Thank you! The magic number was 40 (41 crashes). 40 gives 1%/99% CPU/GPU and runs fast 💨
Author
Owner

@js-0s commented on GitHub (Nov 10, 2025):

similar issue with a similar model, granite4:tiny-h crash with setting the maximum context size (as done by eg zed-editor by default)

ollama-1  | llama_kv_cache:        CPU KV buffer size =  8192.00 MiB                                                                                           
ollama-1  | llama_kv_cache: size = 8192.00 MiB (1048576 cells,   4 layers,  1/1 seqs), K (f16): 4096.00 MiB, V (f16): 4096.00 MiB                              
ollama-1  | llama_memory_recurrent:        CPU RS buffer size =    55.37 MiB                                                                                   
ollama-1  | llama_memory_recurrent: size =   55.37 MiB (     1 cells,  40 layers,  1 seqs), R (f32):    1.37 MiB, S (f32):   54.00 MiB                         
ollama-1  | //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:325: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed                                                  
ollama-1  | /usr/lib/ollama/libggml-base.so(+0x178c8)[0x7057700308c8]                                                                                          
ollama-1  | /usr/lib/ollama/libggml-base.so(ggml_print_backtrace+0x1e6)[0x705770030c96]                                                                        
ollama-1  | /usr/lib/ollama/libggml-base.so(ggml_abort+0x11d)[0x705770030e1d]                                                                                  
ollama-1  | /usr/lib/ollama/cuda_v12/libggml-cuda.so(_Z13ggml_cuda_cpyR25ggml_backend_cuda_contextPK11ggml_tensorPS1_b+0x1023)[0x7056ef101e03]                 
ollama-1  | /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x11e2c5)[0x7056ef1492c5]                                                                                
ollama-1  | /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x11ec0d)[0x7056ef149c0d]                                                                                
ollama-1  | /usr/bin/ollama(+0x10dd1e9)[0x57b28592a1e9]                                                                                                        
ollama-1  | /usr/bin/ollama(+0x1158d74)[0x57b2859a5d74]                                                                                                        
ollama-1  | /usr/bin/ollama(+0x115c95f)[0x57b2859a995f]                                                                                                        
ollama-1  | /usr/bin/ollama(+0x115d4cf)[0x57b2859aa4cf]                                                                                                        
ollama-1  | /usr/bin/ollama(+0x1074429)[0x57b2858c1429]                                                                                                        
ollama-1  | /usr/bin/ollama(+0x36bd21)[0x57b284bb8d21]                                                                                                         
ollama-1  | SIGABRT: abort                                                                                                                                     
ollama-1  | PC=0x7057b8611b2c m=5 sigcode=18446744073709551610                                                                                                 
ollama-1  | signal arrived during cgo execution                                                                                                                
ollama-1  |                                                                                                                                                    
ollama-1  | goroutine 44 gp=0xc000103880 m=5 mp=0xc000100008 [syscall]:                                                                                        
ollama-1  | runtime.cgocall(0x57b2858c13c0, 0xc00016fbf8)                                                                                                      
ollama-1  |     runtime/cgocall.go:167 +0x4b fp=0xc00016fbd0 sp=0xc00016fb98 pc=0x57b284badd8b                                                                 
ollama-1  | github.com/ollama/ollama/llama._Cfunc_llama_init_from_model(0x70575c000da0, {0x100000, 0x200, 0x200, 0x1, 0x10, 0x10, 0xffffffff, 0xffffffff, 0xfff
fffff, ...})                                                                                                                                                   
ollama-1  |     _cgo_gotypes.go:748 +0x4e fp=0xc00016fbf8 sp=0xc00016fbd0 pc=0x57b284f664ae                                                                    
ollama-1  | github.com/ollama/ollama/llama.NewContextWithModel.func1(...)                                                                                      
ollama-1  |     github.com/ollama/ollama/llama/llama.go:280                                                                                                    
ollama-1  | github.com/ollama/ollama/llama.NewContextWithModel(0xc000460798, {{0x100000, 0x200, 0x200, 0x1, 0x10, 0x10, 0xffffffff, 0xffffffff, 0xffffffff, ...
}})                                                                                                                                                            
ollama-1  |     github.com/ollama/ollama/llama/llama.go:280 +0x158 fp=0xc00016fd98 sp=0xc00016fbf8 pc=0x57b284f6a278                                           
ollama-1  | github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc000539860, {0x0, 0x0, 0x0, {0xc0004604d0, 0x1, 0x1}, 0xc0003b60c0, 0x0}, {0x7fff
cb9fddc4, ...}, ...)                                                                                                                                           
ollama-1  |     github.com/ollama/ollama/runner/llamarunner/runner.go:797 +0x198 fp=0xc00016fee0 sp=0xc00016fd98 pc=0x57b285027bb8                             
ollama-1  | github.com/ollama/ollama/runner/llamarunner.(*Server).load.gowrap2()                                                                               
ollama-1  |     github.com/ollama/ollama/runner/llamarunner/runner.go:879 +0x175 fp=0xc00016ffe0 sp=0xc00016fee0 pc=0x57b285028c55                             
ollama-1  | runtime.goexit({})                                                                                                                                 
ollama-1  |     runtime/asm_amd64.s:1700 +0x1 fp=0xc00016ffe8 sp=0xc00016ffe0 pc=0x57b284bb90a1                                                                
ollama-1  | created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 10                                                              
ollama-1  |     github.com/ollama/ollama/runner/llamarunner/runner.go:879 +0x7ce                                                                               

(ollama 0.12.10)

i suspect that the context-size of 1M causes troubles:

ollama show granite4:tiny-h
  Model
    architecture        granitehybrid    
    parameters          6.9B             
    context length      1048576          
    embedding length    1536             
    quantization        Q4_K_M           

  Capabilities
    completion    
    tools         

manually setting the context to 60k loads the model gracefully into the GPU

ollama ps
NAME               ID              SIZE      PROCESSOR    CONTEXT    UNTIL   
granite4:tiny-h    566b725534ea    6.5 GB    100% GPU     60000      Forever 
<!-- gh-comment-id:3512181997 --> @js-0s commented on GitHub (Nov 10, 2025): similar issue with a similar model, granite4:tiny-h crash with setting the maximum context size (as done by eg zed-editor by default) ``` ollama-1 | llama_kv_cache: CPU KV buffer size = 8192.00 MiB ollama-1 | llama_kv_cache: size = 8192.00 MiB (1048576 cells, 4 layers, 1/1 seqs), K (f16): 4096.00 MiB, V (f16): 4096.00 MiB ollama-1 | llama_memory_recurrent: CPU RS buffer size = 55.37 MiB ollama-1 | llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB ollama-1 | //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:325: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed ollama-1 | /usr/lib/ollama/libggml-base.so(+0x178c8)[0x7057700308c8] ollama-1 | /usr/lib/ollama/libggml-base.so(ggml_print_backtrace+0x1e6)[0x705770030c96] ollama-1 | /usr/lib/ollama/libggml-base.so(ggml_abort+0x11d)[0x705770030e1d] ollama-1 | /usr/lib/ollama/cuda_v12/libggml-cuda.so(_Z13ggml_cuda_cpyR25ggml_backend_cuda_contextPK11ggml_tensorPS1_b+0x1023)[0x7056ef101e03] ollama-1 | /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x11e2c5)[0x7056ef1492c5] ollama-1 | /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x11ec0d)[0x7056ef149c0d] ollama-1 | /usr/bin/ollama(+0x10dd1e9)[0x57b28592a1e9] ollama-1 | /usr/bin/ollama(+0x1158d74)[0x57b2859a5d74] ollama-1 | /usr/bin/ollama(+0x115c95f)[0x57b2859a995f] ollama-1 | /usr/bin/ollama(+0x115d4cf)[0x57b2859aa4cf] ollama-1 | /usr/bin/ollama(+0x1074429)[0x57b2858c1429] ollama-1 | /usr/bin/ollama(+0x36bd21)[0x57b284bb8d21] ollama-1 | SIGABRT: abort ollama-1 | PC=0x7057b8611b2c m=5 sigcode=18446744073709551610 ollama-1 | signal arrived during cgo execution ollama-1 | ollama-1 | goroutine 44 gp=0xc000103880 m=5 mp=0xc000100008 [syscall]: ollama-1 | runtime.cgocall(0x57b2858c13c0, 0xc00016fbf8) ollama-1 | runtime/cgocall.go:167 +0x4b fp=0xc00016fbd0 sp=0xc00016fb98 pc=0x57b284badd8b ollama-1 | github.com/ollama/ollama/llama._Cfunc_llama_init_from_model(0x70575c000da0, {0x100000, 0x200, 0x200, 0x1, 0x10, 0x10, 0xffffffff, 0xffffffff, 0xfff fffff, ...}) ollama-1 | _cgo_gotypes.go:748 +0x4e fp=0xc00016fbf8 sp=0xc00016fbd0 pc=0x57b284f664ae ollama-1 | github.com/ollama/ollama/llama.NewContextWithModel.func1(...) ollama-1 | github.com/ollama/ollama/llama/llama.go:280 ollama-1 | github.com/ollama/ollama/llama.NewContextWithModel(0xc000460798, {{0x100000, 0x200, 0x200, 0x1, 0x10, 0x10, 0xffffffff, 0xffffffff, 0xffffffff, ... }}) ollama-1 | github.com/ollama/ollama/llama/llama.go:280 +0x158 fp=0xc00016fd98 sp=0xc00016fbf8 pc=0x57b284f6a278 ollama-1 | github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc000539860, {0x0, 0x0, 0x0, {0xc0004604d0, 0x1, 0x1}, 0xc0003b60c0, 0x0}, {0x7fff cb9fddc4, ...}, ...) ollama-1 | github.com/ollama/ollama/runner/llamarunner/runner.go:797 +0x198 fp=0xc00016fee0 sp=0xc00016fd98 pc=0x57b285027bb8 ollama-1 | github.com/ollama/ollama/runner/llamarunner.(*Server).load.gowrap2() ollama-1 | github.com/ollama/ollama/runner/llamarunner/runner.go:879 +0x175 fp=0xc00016ffe0 sp=0xc00016fee0 pc=0x57b285028c55 ollama-1 | runtime.goexit({}) ollama-1 | runtime/asm_amd64.s:1700 +0x1 fp=0xc00016ffe8 sp=0xc00016ffe0 pc=0x57b284bb90a1 ollama-1 | created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 10 ollama-1 | github.com/ollama/ollama/runner/llamarunner/runner.go:879 +0x7ce ``` (ollama 0.12.10) i suspect that the context-size of 1M causes troubles: ``` ollama show granite4:tiny-h Model architecture granitehybrid parameters 6.9B context length 1048576 embedding length 1536 quantization Q4_K_M Capabilities completion tools ``` manually setting the context to 60k loads the model gracefully into the GPU ``` ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL granite4:tiny-h 566b725534ea 6.5 GB 100% GPU 60000 Forever ```
Author
Owner

@rick-github commented on GitHub (Nov 10, 2025):

ollama-1  | //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:325: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed                                                  

Large context is also causing the same error message in https://github.com/ollama/ollama/issues/12998.

<!-- gh-comment-id:3512205871 --> @rick-github commented on GitHub (Nov 10, 2025): ``` ollama-1 | //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:325: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed ``` Large context is also causing the same error message in https://github.com/ollama/ollama/issues/12998.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55114