[GH-ISSUE #3711] CUDA malloc fails on newly supported models in 0.1.32 (dual-GPU setup with 72GB VRAM and 128GB RAM) #48796

New Issue

GiteaMirror · 2026-04-28T09:16:53-05:00

GiteaMirror commented

2026-04-28 09:16:53 -05:00

Originally created by @mz2 on GitHub (Apr 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3711

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I am getting cuda malloc errors with v0.1.32 (as well as with the current head of main branch) when trying any of the new big models: wizardlm2, mixtral:8x22b, dbrx (command-r+ does work) with my dual GPU setup (A6000 + RTX 3090, i.e. combined 72GB VRAM) with a 24-core 13th gen Intel CPU (128GB of DDR5 on the system).

The symptoms are similar, with llama-cpp dying

ollama run mixtral:8x22b
Error: llama runner process no longer running: 1 error:failed to create context with model '/media/data/ollama/blobs/sha256-b5fc1eb35edf792b07d6163cf7ac162fdd9f9024903e6b33a3a870f2f973b8ca'

From the service logs I see:

Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.621+03:00 level=INFO source=server.go:136 msg="offload to gpu" layers.real=39 layers.estimate=39 memory.available="70186.6 MiB" memory.required.full="72169.5 MiB" memory.required.partial="69599.9 MiB" memory.required.kv="320.0 MiB" memory.weights.total="70752.5 MiB" memory.weights.repeating="69939.4 MiB" memory.weights.nonrepeating="813.1 MiB" memory.graph.full="640.0 MiB" memory.graph.partial="640.0 MiB"
Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.621+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.622+03:00 level=INFO source=server.go:302 msg="starting llama server" cmd="/tmp/ollama3763278260/runners/cuda_v12/ollama_llama_server --model /media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 39 --port 38939"
Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.622+03:00 level=INFO source=server.go:427 msg="waiting for llama runner to start responding"
Apr 17 23:09:42 athena ollama.listener[2841851]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"123385345642496","timestamp":1713384582}
Apr 17 23:09:42 athena ollama.listener[2841851]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"123385345642496","timestamp":1713384582}
Apr 17 23:09:42 athena ollama.listener[2841851]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"123385345642496","timestamp":1713384582,"total_threads":32}
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: loaded meta data with 24 key-value pairs and 323 tensors from /media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 (version GGUF V3 (latest))
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   0:                       general.architecture str              = dbrx
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   1:                               general.name str              = dbrx
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   2:                           dbrx.block_count u32              = 40
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   3:                        dbrx.context_length u32              = 32768
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   4:                      dbrx.embedding_length u32              = 6144
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   5:                   dbrx.feed_forward_length u32              = 10752
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   6:                  dbrx.attention.head_count u32              = 48
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   7:               dbrx.attention.head_count_kv u32              = 8
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   8:                        dbrx.rope.freq_base f32              = 500000.000000
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   9:                   dbrx.attention.clamp_kqv f32              = 8.000000
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  11:                          dbrx.expert_count u32              = 16
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  12:                     dbrx.expert_used_count u32              = 4
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  13:          dbrx.attention.layer_norm_epsilon f32              = 0.000010
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 100257
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 100257
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 100257
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 100277
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type  f32:   81 tensors
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type  f16:   40 tensors
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type q4_0:  201 tensors
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type q6_K:    1 tensors
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_vocab: special tokens definition check successful ( 96/100352 ).
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: format           = GGUF V3 (latest)
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: arch             = dbrx
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: vocab type       = BPE
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_vocab          = 100352
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_merges         = 100000
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_ctx_train      = 32768
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd           = 6144
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_head           = 48
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_head_kv        = 8
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_layer          = 40
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_rot            = 128
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_head_k    = 128
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_head_v    = 128
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_gqa            = 6
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_k_gqa     = 1024
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_v_gqa     = 1024
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_norm_eps       = 1.0e-05
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_clamp_kqv      = 8.0e+00
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_ff             = 10752
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_expert         = 16
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_expert_used    = 4
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: causal attn      = 1
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: pooling type     = 0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope type        = 2
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope scaling     = linear
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: freq_base_train  = 500000.0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: freq_scale_train = 1
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope_finetuned   = unknown
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_conv       = 0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_inner      = 0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_state      = 0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model type       = 16x12B
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model ftype      = Q4_0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model params     = 131.60 B
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model size       = 69.09 GiB (4.51 BPW)
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: general.name     = dbrx
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: BOS token        = 100257 '<|endoftext|>'
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: EOS token        = 100257 '<|endoftext|>'
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: UNK token        = 100257 '<|endoftext|>'
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: PAD token        = 100277 '<|pad|>'
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: LF token         = 128 'Ä'
Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: found 2 CUDA devices:
Apr 17 23:09:42 athena ollama.listener[2840147]:   Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
Apr 17 23:09:42 athena ollama.listener[2840147]:   Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_tensors: ggml ctx size =    1.10 MiB
Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: offloading 39 repeating layers to GPU
Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: offloaded 39/41 layers to GPU
Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors:        CPU buffer size = 70752.49 MiB
Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors:      CUDA0 buffer size = 45460.59 MiB
Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors:      CUDA1 buffer size = 22730.30 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: ....................................................................................................
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_ctx      = 2048
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_batch    = 512
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_ubatch   = 512
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: freq_base  = 500000.0
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: freq_scale = 1
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init:  CUDA_Host KV buffer size =     8.00 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init:      CUDA0 KV buffer size =   208.00 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init:      CUDA1 KV buffer size =   104.00 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: KV self size  =  320.00 MiB, K (f16):  160.00 MiB, V (f16):  160.00 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.41 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1794.00 MiB on device 0: cudaMalloc failed: out of memory
Apr 17 23:09:53 athena ollama.listener[2840147]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1881147392
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: failed to allocate compute buffers
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_init_from_gpt_params: error: failed to create context with model '/media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8'
Apr 17 23:09:54 athena ollama.listener[2841851]: {"function":"load_model","level":"ERR","line":410,"model":"/media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8","msg":"unable to load model","tid":"123385345642496","timestamp":1713384594}

When watching nvidia-smi (watch nvidia-smi) I see that the GPU 0 (the A6000) gets its memory nearly fully allocated, before the malloc failure begins.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.32 (or current head of main branch)

Originally created by @mz2 on GitHub (Apr 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3711 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I am getting cuda malloc errors with v0.1.32 (as well as with the current head of main branch) when trying any of the new big models: wizardlm2, mixtral:8x22b, dbrx (command-r+ does work) with my dual GPU setup (A6000 + RTX 3090, i.e. combined 72GB VRAM) with a 24-core 13th gen Intel CPU (128GB of DDR5 on the system). The symptoms are similar, with llama-cpp dying ``` ollama run mixtral:8x22b Error: llama runner process no longer running: 1 error:failed to create context with model '/media/data/ollama/blobs/sha256-b5fc1eb35edf792b07d6163cf7ac162fdd9f9024903e6b33a3a870f2f973b8ca' ``` From the service logs I see: ``` Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.621+03:00 level=INFO source=server.go:136 msg="offload to gpu" layers.real=39 layers.estimate=39 memory.available="70186.6 MiB" memory.required.full="72169.5 MiB" memory.required.partial="69599.9 MiB" memory.required.kv="320.0 MiB" memory.weights.total="70752.5 MiB" memory.weights.repeating="69939.4 MiB" memory.weights.nonrepeating="813.1 MiB" memory.graph.full="640.0 MiB" memory.graph.partial="640.0 MiB" Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.621+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.622+03:00 level=INFO source=server.go:302 msg="starting llama server" cmd="/tmp/ollama3763278260/runners/cuda_v12/ollama_llama_server --model /media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 39 --port 38939" Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.622+03:00 level=INFO source=server.go:427 msg="waiting for llama runner to start responding" Apr 17 23:09:42 athena ollama.listener[2841851]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"123385345642496","timestamp":1713384582} Apr 17 23:09:42 athena ollama.listener[2841851]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"123385345642496","timestamp":1713384582} Apr 17 23:09:42 athena ollama.listener[2841851]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"123385345642496","timestamp":1713384582,"total_threads":32} Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: loaded meta data with 24 key-value pairs and 323 tensors from /media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 (version GGUF V3 (latest)) Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 0: general.architecture str = dbrx Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 1: general.name str = dbrx Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 2: dbrx.block_count u32 = 40 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 3: dbrx.context_length u32 = 32768 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 4: dbrx.embedding_length u32 = 6144 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 5: dbrx.feed_forward_length u32 = 10752 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 6: dbrx.attention.head_count u32 = 48 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 7: dbrx.attention.head_count_kv u32 = 8 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 8: dbrx.rope.freq_base f32 = 500000.000000 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 9: dbrx.attention.clamp_kqv f32 = 8.000000 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 10: general.file_type u32 = 2 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 11: dbrx.expert_count u32 = 16 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 12: dbrx.expert_used_count u32 = 4 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 13: dbrx.attention.layer_norm_epsilon f32 = 0.000010 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 100257 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 100257 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 100257 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 100277 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 22: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 23: general.quantization_version u32 = 2 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type f32: 81 tensors Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type f16: 40 tensors Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type q4_0: 201 tensors Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type q6_K: 1 tensors Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_vocab: special tokens definition check successful ( 96/100352 ). Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: format = GGUF V3 (latest) Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: arch = dbrx Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: vocab type = BPE Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_vocab = 100352 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_merges = 100000 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_ctx_train = 32768 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd = 6144 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_head = 48 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_head_kv = 8 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_layer = 40 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_rot = 128 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_head_k = 128 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_head_v = 128 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_gqa = 6 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_k_gqa = 1024 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_v_gqa = 1024 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_norm_eps = 1.0e-05 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_clamp_kqv = 8.0e+00 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_logit_scale = 0.0e+00 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_ff = 10752 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_expert = 16 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_expert_used = 4 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: causal attn = 1 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: pooling type = 0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope type = 2 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope scaling = linear Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: freq_base_train = 500000.0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: freq_scale_train = 1 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_yarn_orig_ctx = 32768 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope_finetuned = unknown Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_conv = 0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_inner = 0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_state = 0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_dt_rank = 0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model type = 16x12B Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model ftype = Q4_0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model params = 131.60 B Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model size = 69.09 GiB (4.51 BPW) Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: general.name = dbrx Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: BOS token = 100257 '<|endoftext|>' Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: EOS token = 100257 '<|endoftext|>' Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: UNK token = 100257 '<|endoftext|>' Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: PAD token = 100277 '<|pad|>' Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: LF token = 128 'Ä' Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: found 2 CUDA devices: Apr 17 23:09:42 athena ollama.listener[2840147]: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes Apr 17 23:09:42 athena ollama.listener[2840147]: Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_tensors: ggml ctx size = 1.10 MiB Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: offloading 39 repeating layers to GPU Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: offloaded 39/41 layers to GPU Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: CPU buffer size = 70752.49 MiB Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: CUDA0 buffer size = 45460.59 MiB Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: CUDA1 buffer size = 22730.30 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: .................................................................................................... Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_ctx = 2048 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_batch = 512 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_ubatch = 512 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: freq_base = 500000.0 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: freq_scale = 1 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init: CUDA_Host KV buffer size = 8.00 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init: CUDA0 KV buffer size = 208.00 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init: CUDA1 KV buffer size = 104.00 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16): 160.00 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: CUDA_Host output buffer size = 0.41 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1794.00 MiB on device 0: cudaMalloc failed: out of memory Apr 17 23:09:53 athena ollama.listener[2840147]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1881147392 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: failed to allocate compute buffers Apr 17 23:09:53 athena ollama.listener[2840147]: llama_init_from_gpt_params: error: failed to create context with model '/media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8' Apr 17 23:09:54 athena ollama.listener[2841851]: {"function":"load_model","level":"ERR","line":410,"model":"/media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8","msg":"unable to load model","tid":"123385345642496","timestamp":1713384594} ``` When watching nvidia-smi (`watch nvidia-smi`) I see that the GPU 0 (the A6000) gets its memory nearly fully allocated, before the malloc failure begins. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.32 (or current head of main branch)

GiteaMirror added the bug label 2026-04-28 09:16:53 -05:00

GiteaMirror closed this issue

2026-04-28 09:16:56 -05:00

GiteaMirror commented

2026-04-28 09:16:57 -05:00

@elabeca commented on GitHub (Apr 18, 2024):

I have the same issue with NVIDIA driver version: 535.171.04, CUDA Version 12.2, and 2 x RTX 4090.

Upgraded to 545.29.06, CUDA Version 12.3, it worked, then rebooted and it failed again with the following message:

Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa'

@elabeca commented on GitHub (Apr 18, 2024): I have the same issue with NVIDIA driver version: 535.171.04, CUDA Version 12.2, and 2 x RTX 4090. Upgraded to 545.29.06, CUDA Version 12.3, it worked, then rebooted and it failed again with the following message: ``` Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa' ```

GiteaMirror commented

2026-04-28 09:16:58 -05:00

@nkeilar commented on GitHub (Apr 18, 2024):

I'm getting similar messages with dual 3090 cards

@nkeilar commented on GitHub (Apr 18, 2024): I'm getting similar messages with dual 3090 cards

GiteaMirror commented

2026-04-28 09:16:58 -05:00

@one-bit commented on GitHub (Apr 18, 2024):

I'm getting the same error. I'm running ollama 0.1.32 on Linux. I have a RTX 4090 and a GTX 1080 TI, and 128 GB of DDR4 RAM, and an Intel Core i9-7980XE with 18 cores/36 threads.

ollama run wizardlm2:8x22b-q2_K

Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-502d8aaa76270a2f3faa4bf4c7aa6fc7f890cd14faf4885e2a78889cc7953195'

ollama --version

ollama version is 0.1.32

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:17:00.0  On |                    0 |
|  0%   54C    P5             35W /  450W |    4190MiB /  23028MiB |     42%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:65:00.0 Off |                  N/A |
|  0%   46C    P8             13W /  275W |     146MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

screenfetch

 ██████████████████  ████████     
 ██████████████████  ████████     OS: Manjaro 23.1.4 Vulcan
 ██████████████████  ████████     Kernel: x86_64 Linux 6.8.5-1-MANJARO
 ██████████████████  ████████     Uptime: 3d 7h 46m
 ████████            ████████     Packages: 2418
 ████████  ████████  ████████     Shell: fish 3.7.0
 ████████  ████████  ████████     Resolution: 7680x1440
 ████████  ████████  ████████     DE: KDE 5.115.0 / Plasma 5.27.11
 ████████  ████████  ████████     WM: KWin
 ████████  ████████  ████████     GTK Theme: Breeze [GTK2/3]

 ████████  ████████  ████████     Icon Theme: breeze
 ████████  ████████  ████████     Disk: 4,7T / 20T (24%)
 ████████  ████████  ████████     CPU: Intel Core i9-7980XE @ 36x 4.2GHz [37.0°C]
 ████████  ████████  ████████     GPU: NVIDIA GeForce RTX 4090, NVIDIA GeForce GTX 1080 Ti
                                  RAM: 33517MiB / 128494MiB

@one-bit commented on GitHub (Apr 18, 2024): I'm getting the same error. I'm running ollama 0.1.32 on Linux. I have a RTX 4090 and a GTX 1080 TI, and 128 GB of DDR4 RAM, and an Intel Core i9-7980XE with 18 cores/36 threads. ### ollama run wizardlm2:8x22b-q2_K ``` Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-502d8aaa76270a2f3faa4bf4c7aa6fc7f890cd14faf4885e2a78889cc7953195' ``` ### ollama --version ``` ollama version is 0.1.32 ``` ### nvidia-smi ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:17:00.0 On | 0 | | 0% 54C P5 35W / 450W | 4190MiB / 23028MiB | 42% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce GTX 1080 Ti Off | 00000000:65:00.0 Off | N/A | | 0% 46C P8 13W / 275W | 146MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` ### screenfetch ``` ██████████████████ ████████ ██████████████████ ████████ OS: Manjaro 23.1.4 Vulcan ██████████████████ ████████ Kernel: x86_64 Linux 6.8.5-1-MANJARO ██████████████████ ████████ Uptime: 3d 7h 46m ████████ ████████ Packages: 2418 ████████ ████████ ████████ Shell: fish 3.7.0 ████████ ████████ ████████ Resolution: 7680x1440 ████████ ████████ ████████ DE: KDE 5.115.0 / Plasma 5.27.11 ████████ ████████ ████████ WM: KWin ████████ ████████ ████████ GTK Theme: Breeze [GTK2/3] ████████ ████████ ████████ Icon Theme: breeze ████████ ████████ ████████ Disk: 4,7T / 20T (24%) ████████ ████████ ████████ CPU: Intel Core i9-7980XE @ 36x 4.2GHz [37.0°C] ████████ ████████ ████████ GPU: NVIDIA GeForce RTX 4090, NVIDIA GeForce GTX 1080 Ti RAM: 33517MiB / 128494MiB ```

GiteaMirror commented

2026-04-28 09:17:01 -05:00

@s00x3r commented on GitHub (Apr 18, 2024):

same Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa'

@s00x3r commented on GitHub (Apr 18, 2024): same Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa'

GiteaMirror commented

2026-04-28 09:17:03 -05:00

@MarkWard0110 commented on GitHub (Apr 18, 2024):

I'm getting the same error with the new models.

Intel Core i9 14900k
DDR5 6400 2x48GB (96GB)
Nvidia RTX 4070 TI Super 16GB

Apr 18 18:57:54 quorra ollama[1170]: time=2024-04-18T18:57:54.713Z level=INFO source=routes.go:97 msg="changing loaded model"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.128Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.174Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=8 layers=8 required="71518.7 MiB" used="14828.9 MiB" available="15857.2 MiB" kv="320.0 MiB" fulloffload="320.0 MiB" partialoffload="320.0 MiB"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1615772994/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 8 --port 41305"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding"
Apr 18 18:57:55 quorra ollama[20191]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140044021682176","timestamp":1713466675}
Apr 18 18:57:55 quorra ollama[20191]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140044021682176","timestamp":1713466675}
Apr 18 18:57:55 quorra ollama[20191]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140044021682176","timestamp":1713466675,"total_threads":32}
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: loaded meta data with 24 key-value pairs and 323 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 (version GGUF V3 (latest))
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   0:                       general.architecture str              = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   1:                               general.name str              = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   2:                           dbrx.block_count u32              = 40
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   3:                        dbrx.context_length u32              = 32768
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   4:                      dbrx.embedding_length u32              = 6144
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   5:                   dbrx.feed_forward_length u32              = 10752
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   6:                  dbrx.attention.head_count u32              = 48
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   7:               dbrx.attention.head_count_kv u32              = 8
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   8:                        dbrx.rope.freq_base f32              = 500000.000000
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   9:                   dbrx.attention.clamp_kqv f32              = 8.000000
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  11:                          dbrx.expert_count u32              = 16
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  12:                     dbrx.expert_used_count u32              = 4
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  13:          dbrx.attention.layer_norm_epsilon f32              = 0.000010
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 100257
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 100257
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 100257
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 100277
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type  f32:   81 tensors
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type  f16:   40 tensors
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q4_0:  201 tensors
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q6_K:    1 tensors
Apr 18 18:57:55 quorra ollama[1170]: llm_load_vocab: special tokens definition check successful ( 96/100352 ).
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: format           = GGUF V3 (latest)
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: arch             = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: vocab type       = BPE
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_vocab          = 100352
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_merges         = 100000
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ctx_train      = 32768
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd           = 6144
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head           = 48
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head_kv        = 8
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_layer          = 40
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_rot            = 128
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_k    = 128
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_v    = 128
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_gqa            = 6
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_k_gqa     = 1024
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_v_gqa     = 1024
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_eps       = 1.0e-05
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_clamp_kqv      = 8.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ff             = 10752
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert         = 16
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert_used    = 4
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: causal attn      = 1
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: pooling type     = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope type        = 2
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope scaling     = linear
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_base_train  = 500000.0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_scale_train = 1
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope_finetuned   = unknown
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_conv       = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_inner      = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_state      = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model type       = 16x12B
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model ftype      = Q4_0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model params     = 131.60 B
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model size       = 69.09 GiB (4.51 BPW)
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: general.name     = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: BOS token        = 100257 '<|endoftext|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: EOS token        = 100257 '<|endoftext|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: UNK token        = 100257 '<|endoftext|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: PAD token        = 100277 '<|pad|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: LF token         = 128 'Ä'
Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: found 1 CUDA devices:
Apr 18 18:57:55 quorra ollama[1170]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Apr 18 18:57:55 quorra ollama[1170]: llm_load_tensors: ggml ctx size =    0.74 MiB
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloading 8 repeating layers to GPU
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloaded 8/41 layers to GPU
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors:        CPU buffer size = 70752.49 MiB
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors:      CUDA0 buffer size = 13987.88 MiB
Apr 18 18:58:19 quorra ollama[1170]: ....................................................................................................
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ctx      = 2048
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_batch    = 512
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ubatch   = 512
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_base  = 500000.0
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_scale = 1
Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init:  CUDA_Host KV buffer size =   256.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init:      CUDA0 KV buffer size =    64.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: KV self size  =  320.00 MiB, K (f16):  160.00 MiB, V (f16):  160.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.41 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model:      CUDA0 compute buffer size =  1794.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph nodes  = 2886
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph splits = 325
Apr 18 18:58:20 quorra ollama[1170]: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED
Apr 18 18:58:20 quorra ollama[1170]:   current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526
Apr 18 18:58:20 quorra ollama[1170]:   cublasCreate_v2(&cublas_handles[device])
Apr 18 18:58:20 quorra ollama[1170]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
Apr 18 18:58:20 quorra ollama[1170]: time=2024-04-18T18:58:20.740Z level=ERROR source=routes.go:120 msg="error loading llama server" error="llama runner process no longer running: -1 CUDA error: CUBLAS_STATUS_NOT_INITIALIZED\n  current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526\n  cublasCreate_v2(&cublas_handles[device])\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !\"CUDA error\""
Apr 18 18:58:20 quorra ollama[1170]: [GIN] 2024/04/18 - 18:58:20 | 500 | 26.029470037s |      10.0.0.123 | POST     "/api/generate"

@MarkWard0110 commented on GitHub (Apr 18, 2024): I'm getting the same error with the new models. Intel Core i9 14900k DDR5 6400 2x48GB (96GB) Nvidia RTX 4070 TI Super 16GB ``` Apr 18 18:57:54 quorra ollama[1170]: time=2024-04-18T18:57:54.713Z level=INFO source=routes.go:97 msg="changing loaded model" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.128Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.174Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=8 layers=8 required="71518.7 MiB" used="14828.9 MiB" available="15857.2 MiB" kv="320.0 MiB" fulloffload="320.0 MiB" partialoffload="320.0 MiB" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1615772994/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 8 --port 41305" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding" Apr 18 18:57:55 quorra ollama[20191]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140044021682176","timestamp":1713466675} Apr 18 18:57:55 quorra ollama[20191]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140044021682176","timestamp":1713466675} Apr 18 18:57:55 quorra ollama[20191]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140044021682176","timestamp":1713466675,"total_threads":32} Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: loaded meta data with 24 key-value pairs and 323 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 (version GGUF V3 (latest)) Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 0: general.architecture str = dbrx Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 1: general.name str = dbrx Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 2: dbrx.block_count u32 = 40 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 3: dbrx.context_length u32 = 32768 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 4: dbrx.embedding_length u32 = 6144 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 5: dbrx.feed_forward_length u32 = 10752 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 6: dbrx.attention.head_count u32 = 48 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 7: dbrx.attention.head_count_kv u32 = 8 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 8: dbrx.rope.freq_base f32 = 500000.000000 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 9: dbrx.attention.clamp_kqv f32 = 8.000000 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 10: general.file_type u32 = 2 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 11: dbrx.expert_count u32 = 16 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 12: dbrx.expert_used_count u32 = 4 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 13: dbrx.attention.layer_norm_epsilon f32 = 0.000010 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 100257 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 100257 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 100257 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 100277 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 22: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 23: general.quantization_version u32 = 2 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type f32: 81 tensors Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type f16: 40 tensors Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q4_0: 201 tensors Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q6_K: 1 tensors Apr 18 18:57:55 quorra ollama[1170]: llm_load_vocab: special tokens definition check successful ( 96/100352 ). Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: format = GGUF V3 (latest) Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: arch = dbrx Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: vocab type = BPE Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_vocab = 100352 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_merges = 100000 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ctx_train = 32768 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd = 6144 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head = 48 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head_kv = 8 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_layer = 40 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_rot = 128 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_k = 128 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_v = 128 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_gqa = 6 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_k_gqa = 1024 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_v_gqa = 1024 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_eps = 1.0e-05 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_clamp_kqv = 8.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_logit_scale = 0.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ff = 10752 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert = 16 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert_used = 4 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: causal attn = 1 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: pooling type = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope type = 2 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope scaling = linear Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_base_train = 500000.0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_scale_train = 1 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_yarn_orig_ctx = 32768 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope_finetuned = unknown Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_conv = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_inner = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_state = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_dt_rank = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model type = 16x12B Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model ftype = Q4_0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model params = 131.60 B Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model size = 69.09 GiB (4.51 BPW) Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: general.name = dbrx Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: BOS token = 100257 '<|endoftext|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: EOS token = 100257 '<|endoftext|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: UNK token = 100257 '<|endoftext|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: PAD token = 100277 '<|pad|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: LF token = 128 'Ä' Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: found 1 CUDA devices: Apr 18 18:57:55 quorra ollama[1170]: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes Apr 18 18:57:55 quorra ollama[1170]: llm_load_tensors: ggml ctx size = 0.74 MiB Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloading 8 repeating layers to GPU Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloaded 8/41 layers to GPU Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: CPU buffer size = 70752.49 MiB Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: CUDA0 buffer size = 13987.88 MiB Apr 18 18:58:19 quorra ollama[1170]: .................................................................................................... Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ctx = 2048 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_batch = 512 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ubatch = 512 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_base = 500000.0 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_scale = 1 Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init: CUDA_Host KV buffer size = 256.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init: CUDA0 KV buffer size = 64.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16): 160.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: CUDA_Host output buffer size = 0.41 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: CUDA0 compute buffer size = 1794.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph nodes = 2886 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph splits = 325 Apr 18 18:58:20 quorra ollama[1170]: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED Apr 18 18:58:20 quorra ollama[1170]: current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526 Apr 18 18:58:20 quorra ollama[1170]: cublasCreate_v2(&cublas_handles[device]) Apr 18 18:58:20 quorra ollama[1170]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error" Apr 18 18:58:20 quorra ollama[1170]: time=2024-04-18T18:58:20.740Z level=ERROR source=routes.go:120 msg="error loading llama server" error="llama runner process no longer running: -1 CUDA error: CUBLAS_STATUS_NOT_INITIALIZED\n current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526\n cublasCreate_v2(&cublas_handles[device])\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !\"CUDA error\"" Apr 18 18:58:20 quorra ollama[1170]: [GIN] 2024/04/18 - 18:58:20 | 500 | 26.029470037s | 10.0.0.123 | POST "/api/generate" ```

GiteaMirror commented

2026-04-28 09:17:04 -05:00

@KolioM commented on GitHub (Apr 19, 2024):

I have the exact same issue on both WIndows and Linux using two 3090, latest driver...

@KolioM commented on GitHub (Apr 19, 2024): I have the exact same issue on both WIndows and Linux using two 3090, latest driver...

GiteaMirror commented

2026-04-28 09:17:05 -05:00

@s00x3r commented on GitHub (Apr 19, 2024):

Running not in virtual machine: also cpu have
from /proc/cpuinfo
flags :avx avx2

@s00x3r commented on GitHub (Apr 19, 2024): Running not in virtual machine: also cpu have from /proc/cpuinfo flags :avx avx2

GiteaMirror commented

2026-04-28 09:17:06 -05:00

@KolioM commented on GitHub (Apr 19, 2024):

Btw llama 3 70b is running smoothly,
it is only Mixtral 8x22b models.

@KolioM commented on GitHub (Apr 19, 2024): Btw llama 3 70b is running smoothly, it is only Mixtral 8x22b models.

GiteaMirror commented

2026-04-28 09:17:06 -05:00

@s00x3r commented on GitHub (Apr 21, 2024):

FOUND PROBLEM!
Then ollama start model 8x22b with this it works( pay attension the only difference in this parameter (--n-gpu-layers)
working(also if i start --n-gpu-layers 25 working etc) :
ollama[547518]: time=2024-04-21T10:20:54.906+03:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1719319998/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 23 --port 4123"

not working
ollama[547518]: time=2024-04-21T10:11:26.056+03:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1719319998/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --port 30935"

@s00x3r commented on GitHub (Apr 21, 2024): FOUND PROBLEM! Then ollama start model 8x22b with this it works( pay attension the only difference in this parameter (--n-gpu-layers) working(also if i start --n-gpu-layers 25 working etc) : ollama[547518]: time=2024-04-21T10:20:54.906+03:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1719319998/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 23 --port 4123" not working ollama[547518]: time=2024-04-21T10:11:26.056+03:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1719319998/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --port 30935"

GiteaMirror commented

2026-04-28 09:17:08 -05:00

@mz2 commented on GitHub (Apr 21, 2024):

Nice, same story here: in my case the cuda malloc fail inducing command that I see with mixtral:8x22b is, as extracted from the log:

/tmp/ollama346288213/runners/cuda_v12/ollama_llama_server --model /media/data/ollama/blobs/sha256-7ec0c94a95cafef2780d00679e83f172ac343bc828aebbe2a5475fbe2daf76ff --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 51 --port 43359

The same command happily runs if I reduce just 1 layer from the GPU offload: --n-gpu-layers 50 runs happily. Deducting 1 from ollama's estimate for the # of layers appears to be enough to get also dbrx, wizardlm2 running on my dual-GPU setup with cuda_v12 -- I also confirmed this with a local build of ollama where I did this:

	if opts.NumGPU >= 0 {
		params = append(params, "--n-gpu-layers", fmt.Sprintf("%d", opts.NumGPU - 1)) // lol
	}

(Also, unsurprisingly I can also see that the CPU based library is not affected, i.e. OLLAMA_LLM_LIBRARY="cpu_avx2" ollama run mixtral:8x22b also works around the issue.)

@mz2 commented on GitHub (Apr 21, 2024): Nice, same story here: in my case the cuda malloc fail inducing command that I see with mixtral:8x22b is, as extracted from the log: ```bash /tmp/ollama346288213/runners/cuda_v12/ollama_llama_server --model /media/data/ollama/blobs/sha256-7ec0c94a95cafef2780d00679e83f172ac343bc828aebbe2a5475fbe2daf76ff --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 51 --port 43359 ``` The same command happily runs if I reduce just 1 layer from the GPU offload: `--n-gpu-layers 50` runs happily. Deducting 1 from ollama's estimate for the # of layers appears to be enough to get also dbrx, wizardlm2 running on my dual-GPU setup with `cuda_v12` -- I also confirmed this with a local build of ollama where I did this: ```golang if opts.NumGPU >= 0 { params = append(params, "--n-gpu-layers", fmt.Sprintf("%d", opts.NumGPU - 1)) // lol } ``` (Also, unsurprisingly I can also see that the CPU based library is not affected, i.e. `OLLAMA_LLM_LIBRARY="cpu_avx2"` ollama run mixtral:8x22b also works around the issue.)

GiteaMirror commented

2026-04-28 09:17:10 -05:00

@one-bit commented on GitHub (Apr 21, 2024):

In my case, I was able to work around the issue by creating a custom Modelfile with a custom value for the num_gpu parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18).

Here's my custom Modelfile:

FROM wizardlm2:8x22b

PARAMETER num_gpu 18

Then I created a new model entry in ollama using this Modelfile:

ollama create wizardlm-fix -f Modelfile

And finally I was able to load the model using:

ollama run wizardlm-fix

Please note that this does not fix the issue per se, but it may be a temporary workaround if you really want to run/test these models.

@one-bit commented on GitHub (Apr 21, 2024): In my case, I was able to work around the issue by creating a custom `Modelfile` with a custom value for the `num_gpu` parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18). Here's my custom `Modelfile`: ``` FROM wizardlm2:8x22b PARAMETER num_gpu 18 ``` Then I created a new model entry in `ollama` using this `Modelfile`: ``` ollama create wizardlm-fix -f Modelfile ``` And finally I was able to load the model using: ``` ollama run wizardlm-fix ``` Please note that this does not fix the issue _per se_, but it may be a temporary workaround if you really want to run/test these models.

GiteaMirror commented

2026-04-28 09:17:11 -05:00

@nkeilar commented on GitHub (Apr 24, 2024):

This is what worked for me with a dual 3090 setup, and 14900k. Note I limited the context to something quite short as a proof of concept. Response tokens 8.6 tok/sec

FROM wizardlm2:8x22b-q2_K
TEMPLATE """{{ if .System }}{{ .System }} {{ end }}{{ if .Prompt }}USER: {{ .Prompt }} {{ end }}ASSISTANT: {{ .Response }}"""
SYSTEM """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."""
PARAMETER stop "USER:"
PARAMETER stop "ASSISTANT:"
PARAMETER num_ctx 2000
PARAMETER num_gpu 40

@nkeilar commented on GitHub (Apr 24, 2024): This is what worked for me with a dual 3090 setup, and 14900k. Note I limited the context to something quite short as a proof of concept. Response tokens 8.6 tok/sec ``` FROM wizardlm2:8x22b-q2_K TEMPLATE """{{ if .System }}{{ .System }} {{ end }}{{ if .Prompt }}USER: {{ .Prompt }} {{ end }}ASSISTANT: {{ .Response }}""" SYSTEM """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.""" PARAMETER stop "USER:" PARAMETER stop "ASSISTANT:" PARAMETER num_ctx 2000 PARAMETER num_gpu 40 ```

GiteaMirror commented

2026-04-28 09:17:13 -05:00

@plp38 commented on GitHub (May 23, 2024):

In my case, I was able to work around the issue by creating a custom Modelfile with a custom value for the num_gpu parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18).

Here's my custom Modelfile:
FROM wizardlm2:8x22b

PARAMETER num_gpu 18
Then I created a new model entry in ollama using this Modelfile:
ollama create wizardlm-fix -f Modelfile
And finally I was able to load the model using:
ollama run wizardlm-fix
Please note that this does not fix the issue per se, but it may be a temporary workaround if you really want to run/test these models.

Thank you, the 'num_gpu' paramater solved my issue with my GPU. However what does it mean to limit the number of layers, what is the impact ? Sorry for the stupid question, but are the others layers loaded in RAM ?

@plp38 commented on GitHub (May 23, 2024): > In my case, I was able to work around the issue by creating a custom `Modelfile` with a custom value for the `num_gpu` parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18). > > Here's my custom `Modelfile`: > > ``` > FROM wizardlm2:8x22b > > PARAMETER num_gpu 18 > ``` > > Then I created a new model entry in `ollama` using this `Modelfile`: > > ``` > ollama create wizardlm-fix -f Modelfile > ``` > > And finally I was able to load the model using: > > ``` > ollama run wizardlm-fix > ``` > > Please note that this does not fix the issue _per se_, but it may be a temporary workaround if you really want to run/test these models. Thank you, the 'num_gpu' paramater solved my issue with my GPU. However what does it mean to limit the number of layers, what is the impact ? Sorry for the stupid question, but are the others layers loaded in RAM ?

GiteaMirror commented

2026-04-28 09:17:14 -05:00

@one-bit commented on GitHub (May 28, 2024):

In my case, I was able to work around the issue by creating a custom Modelfile with a custom value for the num_gpu parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18).
Here's my custom Modelfile:
FROM wizardlm2:8x22b

PARAMETER num_gpu 18
Then I created a new model entry in ollama using this Modelfile:
ollama create wizardlm-fix -f Modelfile
And finally I was able to load the model using:
ollama run wizardlm-fix
Please note that this does not fix the issue per se, but it may be a temporary workaround if you really want to run/test these models.
Thank you, the 'num_gpu' paramater solved my issue with my GPU. However what does it mean to limit the number of layers, what is the impact ? Sorry for the stupid question, but are the others layers loaded in RAM ?

This parameter limits the number of model layers that can be loaded into the GPU VRAM (and processed by the GPU), the remaining layers in the model will be loaded into the computer's RAM (and processed by the CPU).

@one-bit commented on GitHub (May 28, 2024): > > In my case, I was able to work around the issue by creating a custom `Modelfile` with a custom value for the `num_gpu` parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18). > > Here's my custom `Modelfile`: > > ``` > > FROM wizardlm2:8x22b > > > > PARAMETER num_gpu 18 > > ``` > > > > > > > > > > > > > > > > > > > > > > > > Then I created a new model entry in `ollama` using this `Modelfile`: > > ``` > > ollama create wizardlm-fix -f Modelfile > > ``` > > > > > > > > > > > > > > > > > > > > > > > > And finally I was able to load the model using: > > ``` > > ollama run wizardlm-fix > > ``` > > > > > > > > > > > > > > > > > > > > > > > > Please note that this does not fix the issue _per se_, but it may be a temporary workaround if you really want to run/test these models. > > Thank you, the 'num_gpu' paramater solved my issue with my GPU. However what does it mean to limit the number of layers, what is the impact ? Sorry for the stupid question, but are the others layers loaded in RAM ? This parameter limits the number of model layers that can be loaded into the GPU VRAM (and processed by the GPU), the remaining layers in the model will be loaded into the computer's RAM (and processed by the CPU).

GiteaMirror commented

2026-04-28 09:17:15 -05:00

@dhiltgen commented on GitHub (Jun 1, 2024):

Can you see if the latest release has improved the situation on the larger mixtral models?

@dhiltgen commented on GitHub (Jun 1, 2024): Can you see if the latest release has improved the situation on the larger mixtral models?

GiteaMirror commented

2026-04-28 09:17:16 -05:00

@dhiltgen commented on GitHub (Jun 22, 2024):

The latest release (0.1.45) has improvements around mult-GPU prediction and layer splits which should help these situations. If you're still seeing OOM's after upgrading, please share an updated server log and I'll reopen the issue.

@dhiltgen commented on GitHub (Jun 22, 2024): The latest release (0.1.45) has improvements around mult-GPU prediction and layer splits which should help these situations. If you're still seeing OOM's after upgrading, please share an updated server log and I'll reopen the issue.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#48796