[GH-ISSUE #3711] CUDA malloc fails on newly supported models in 0.1.32 (dual-GPU setup with 72GB VRAM and 128GB RAM) #64322

Closed
opened 2026-05-03 17:04:32 -05:00 by GiteaMirror · 16 comments
Owner

Originally created by @mz2 on GitHub (Apr 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3711

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I am getting cuda malloc errors with v0.1.32 (as well as with the current head of main branch) when trying any of the new big models: wizardlm2, mixtral:8x22b, dbrx (command-r+ does work) with my dual GPU setup (A6000 + RTX 3090, i.e. combined 72GB VRAM) with a 24-core 13th gen Intel CPU (128GB of DDR5 on the system).

The symptoms are similar, with llama-cpp dying

ollama run mixtral:8x22b
Error: llama runner process no longer running: 1 error:failed to create context with model '/media/data/ollama/blobs/sha256-b5fc1eb35edf792b07d6163cf7ac162fdd9f9024903e6b33a3a870f2f973b8ca'

From the service logs I see:

Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.621+03:00 level=INFO source=server.go:136 msg="offload to gpu" layers.real=39 layers.estimate=39 memory.available="70186.6 MiB" memory.required.full="72169.5 MiB" memory.required.partial="69599.9 MiB" memory.required.kv="320.0 MiB" memory.weights.total="70752.5 MiB" memory.weights.repeating="69939.4 MiB" memory.weights.nonrepeating="813.1 MiB" memory.graph.full="640.0 MiB" memory.graph.partial="640.0 MiB"
Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.621+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.622+03:00 level=INFO source=server.go:302 msg="starting llama server" cmd="/tmp/ollama3763278260/runners/cuda_v12/ollama_llama_server --model /media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 39 --port 38939"
Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.622+03:00 level=INFO source=server.go:427 msg="waiting for llama runner to start responding"
Apr 17 23:09:42 athena ollama.listener[2841851]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"123385345642496","timestamp":1713384582}
Apr 17 23:09:42 athena ollama.listener[2841851]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"123385345642496","timestamp":1713384582}
Apr 17 23:09:42 athena ollama.listener[2841851]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"123385345642496","timestamp":1713384582,"total_threads":32}
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: loaded meta data with 24 key-value pairs and 323 tensors from /media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 (version GGUF V3 (latest))
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   0:                       general.architecture str              = dbrx
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   1:                               general.name str              = dbrx
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   2:                           dbrx.block_count u32              = 40
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   3:                        dbrx.context_length u32              = 32768
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   4:                      dbrx.embedding_length u32              = 6144
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   5:                   dbrx.feed_forward_length u32              = 10752
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   6:                  dbrx.attention.head_count u32              = 48
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   7:               dbrx.attention.head_count_kv u32              = 8
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   8:                        dbrx.rope.freq_base f32              = 500000.000000
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv   9:                   dbrx.attention.clamp_kqv f32              = 8.000000
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  11:                          dbrx.expert_count u32              = 16
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  12:                     dbrx.expert_used_count u32              = 4
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  13:          dbrx.attention.layer_norm_epsilon f32              = 0.000010
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 100257
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 100257
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 100257
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 100277
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type  f32:   81 tensors
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type  f16:   40 tensors
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type q4_0:  201 tensors
Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type q6_K:    1 tensors
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_vocab: special tokens definition check successful ( 96/100352 ).
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: format           = GGUF V3 (latest)
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: arch             = dbrx
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: vocab type       = BPE
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_vocab          = 100352
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_merges         = 100000
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_ctx_train      = 32768
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd           = 6144
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_head           = 48
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_head_kv        = 8
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_layer          = 40
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_rot            = 128
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_head_k    = 128
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_head_v    = 128
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_gqa            = 6
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_k_gqa     = 1024
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_v_gqa     = 1024
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_norm_eps       = 1.0e-05
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_clamp_kqv      = 8.0e+00
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_ff             = 10752
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_expert         = 16
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_expert_used    = 4
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: causal attn      = 1
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: pooling type     = 0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope type        = 2
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope scaling     = linear
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: freq_base_train  = 500000.0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: freq_scale_train = 1
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope_finetuned   = unknown
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_conv       = 0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_inner      = 0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_state      = 0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model type       = 16x12B
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model ftype      = Q4_0
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model params     = 131.60 B
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model size       = 69.09 GiB (4.51 BPW)
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: general.name     = dbrx
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: BOS token        = 100257 '<|endoftext|>'
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: EOS token        = 100257 '<|endoftext|>'
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: UNK token        = 100257 '<|endoftext|>'
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: PAD token        = 100277 '<|pad|>'
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: LF token         = 128 'Ä'
Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: found 2 CUDA devices:
Apr 17 23:09:42 athena ollama.listener[2840147]:   Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
Apr 17 23:09:42 athena ollama.listener[2840147]:   Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_tensors: ggml ctx size =    1.10 MiB
Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: offloading 39 repeating layers to GPU
Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: offloaded 39/41 layers to GPU
Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors:        CPU buffer size = 70752.49 MiB
Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors:      CUDA0 buffer size = 45460.59 MiB
Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors:      CUDA1 buffer size = 22730.30 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: ....................................................................................................
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_ctx      = 2048
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_batch    = 512
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_ubatch   = 512
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: freq_base  = 500000.0
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: freq_scale = 1
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init:  CUDA_Host KV buffer size =     8.00 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init:      CUDA0 KV buffer size =   208.00 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init:      CUDA1 KV buffer size =   104.00 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: KV self size  =  320.00 MiB, K (f16):  160.00 MiB, V (f16):  160.00 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.41 MiB
Apr 17 23:09:53 athena ollama.listener[2840147]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1794.00 MiB on device 0: cudaMalloc failed: out of memory
Apr 17 23:09:53 athena ollama.listener[2840147]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1881147392
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: failed to allocate compute buffers
Apr 17 23:09:53 athena ollama.listener[2840147]: llama_init_from_gpt_params: error: failed to create context with model '/media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8'
Apr 17 23:09:54 athena ollama.listener[2841851]: {"function":"load_model","level":"ERR","line":410,"model":"/media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8","msg":"unable to load model","tid":"123385345642496","timestamp":1713384594}

When watching nvidia-smi (watch nvidia-smi) I see that the GPU 0 (the A6000) gets its memory nearly fully allocated, before the malloc failure begins.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.32 (or current head of main branch)

Originally created by @mz2 on GitHub (Apr 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3711 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I am getting cuda malloc errors with v0.1.32 (as well as with the current head of main branch) when trying any of the new big models: wizardlm2, mixtral:8x22b, dbrx (command-r+ does work) with my dual GPU setup (A6000 + RTX 3090, i.e. combined 72GB VRAM) with a 24-core 13th gen Intel CPU (128GB of DDR5 on the system). The symptoms are similar, with llama-cpp dying ``` ollama run mixtral:8x22b Error: llama runner process no longer running: 1 error:failed to create context with model '/media/data/ollama/blobs/sha256-b5fc1eb35edf792b07d6163cf7ac162fdd9f9024903e6b33a3a870f2f973b8ca' ``` From the service logs I see: ``` Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.621+03:00 level=INFO source=server.go:136 msg="offload to gpu" layers.real=39 layers.estimate=39 memory.available="70186.6 MiB" memory.required.full="72169.5 MiB" memory.required.partial="69599.9 MiB" memory.required.kv="320.0 MiB" memory.weights.total="70752.5 MiB" memory.weights.repeating="69939.4 MiB" memory.weights.nonrepeating="813.1 MiB" memory.graph.full="640.0 MiB" memory.graph.partial="640.0 MiB" Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.621+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.622+03:00 level=INFO source=server.go:302 msg="starting llama server" cmd="/tmp/ollama3763278260/runners/cuda_v12/ollama_llama_server --model /media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 39 --port 38939" Apr 17 23:09:42 athena ollama.listener[2840147]: time=2024-04-17T23:09:42.622+03:00 level=INFO source=server.go:427 msg="waiting for llama runner to start responding" Apr 17 23:09:42 athena ollama.listener[2841851]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"123385345642496","timestamp":1713384582} Apr 17 23:09:42 athena ollama.listener[2841851]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"123385345642496","timestamp":1713384582} Apr 17 23:09:42 athena ollama.listener[2841851]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"123385345642496","timestamp":1713384582,"total_threads":32} Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: loaded meta data with 24 key-value pairs and 323 tensors from /media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 (version GGUF V3 (latest)) Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 0: general.architecture str = dbrx Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 1: general.name str = dbrx Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 2: dbrx.block_count u32 = 40 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 3: dbrx.context_length u32 = 32768 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 4: dbrx.embedding_length u32 = 6144 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 5: dbrx.feed_forward_length u32 = 10752 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 6: dbrx.attention.head_count u32 = 48 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 7: dbrx.attention.head_count_kv u32 = 8 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 8: dbrx.rope.freq_base f32 = 500000.000000 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 9: dbrx.attention.clamp_kqv f32 = 8.000000 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 10: general.file_type u32 = 2 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 11: dbrx.expert_count u32 = 16 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 12: dbrx.expert_used_count u32 = 4 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 13: dbrx.attention.layer_norm_epsilon f32 = 0.000010 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 100257 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 100257 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 100257 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 100277 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 22: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - kv 23: general.quantization_version u32 = 2 Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type f32: 81 tensors Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type f16: 40 tensors Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type q4_0: 201 tensors Apr 17 23:09:42 athena ollama.listener[2840147]: llama_model_loader: - type q6_K: 1 tensors Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_vocab: special tokens definition check successful ( 96/100352 ). Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: format = GGUF V3 (latest) Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: arch = dbrx Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: vocab type = BPE Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_vocab = 100352 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_merges = 100000 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_ctx_train = 32768 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd = 6144 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_head = 48 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_head_kv = 8 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_layer = 40 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_rot = 128 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_head_k = 128 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_head_v = 128 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_gqa = 6 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_k_gqa = 1024 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_embd_v_gqa = 1024 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_norm_eps = 1.0e-05 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_clamp_kqv = 8.0e+00 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: f_logit_scale = 0.0e+00 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_ff = 10752 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_expert = 16 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_expert_used = 4 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: causal attn = 1 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: pooling type = 0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope type = 2 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope scaling = linear Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: freq_base_train = 500000.0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: freq_scale_train = 1 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: n_yarn_orig_ctx = 32768 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: rope_finetuned = unknown Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_conv = 0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_inner = 0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_d_state = 0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: ssm_dt_rank = 0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model type = 16x12B Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model ftype = Q4_0 Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model params = 131.60 B Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: model size = 69.09 GiB (4.51 BPW) Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: general.name = dbrx Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: BOS token = 100257 '<|endoftext|>' Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: EOS token = 100257 '<|endoftext|>' Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: UNK token = 100257 '<|endoftext|>' Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: PAD token = 100277 '<|pad|>' Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_print_meta: LF token = 128 'Ä' Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Apr 17 23:09:42 athena ollama.listener[2840147]: ggml_cuda_init: found 2 CUDA devices: Apr 17 23:09:42 athena ollama.listener[2840147]: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes Apr 17 23:09:42 athena ollama.listener[2840147]: Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Apr 17 23:09:42 athena ollama.listener[2840147]: llm_load_tensors: ggml ctx size = 1.10 MiB Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: offloading 39 repeating layers to GPU Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: offloaded 39/41 layers to GPU Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: CPU buffer size = 70752.49 MiB Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: CUDA0 buffer size = 45460.59 MiB Apr 17 23:09:46 athena ollama.listener[2840147]: llm_load_tensors: CUDA1 buffer size = 22730.30 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: .................................................................................................... Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_ctx = 2048 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_batch = 512 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: n_ubatch = 512 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: freq_base = 500000.0 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: freq_scale = 1 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init: CUDA_Host KV buffer size = 8.00 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init: CUDA0 KV buffer size = 208.00 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: llama_kv_cache_init: CUDA1 KV buffer size = 104.00 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16): 160.00 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: CUDA_Host output buffer size = 0.41 MiB Apr 17 23:09:53 athena ollama.listener[2840147]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1794.00 MiB on device 0: cudaMalloc failed: out of memory Apr 17 23:09:53 athena ollama.listener[2840147]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1881147392 Apr 17 23:09:53 athena ollama.listener[2840147]: llama_new_context_with_model: failed to allocate compute buffers Apr 17 23:09:53 athena ollama.listener[2840147]: llama_init_from_gpt_params: error: failed to create context with model '/media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8' Apr 17 23:09:54 athena ollama.listener[2841851]: {"function":"load_model","level":"ERR","line":410,"model":"/media/data/ollama/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8","msg":"unable to load model","tid":"123385345642496","timestamp":1713384594} ``` When watching nvidia-smi (`watch nvidia-smi`) I see that the GPU 0 (the A6000) gets its memory nearly fully allocated, before the malloc failure begins. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.32 (or current head of main branch)
GiteaMirror added the bug label 2026-05-03 17:04:33 -05:00
Author
Owner

@elabeca commented on GitHub (Apr 18, 2024):

I have the same issue with NVIDIA driver version: 535.171.04, CUDA Version 12.2, and 2 x RTX 4090.

Upgraded to 545.29.06, CUDA Version 12.3, it worked, then rebooted and it failed again with the following message:

Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa'
<!-- gh-comment-id:2063614795 --> @elabeca commented on GitHub (Apr 18, 2024): I have the same issue with NVIDIA driver version: 535.171.04, CUDA Version 12.2, and 2 x RTX 4090. Upgraded to 545.29.06, CUDA Version 12.3, it worked, then rebooted and it failed again with the following message: ``` Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa' ```
Author
Owner

@nkeilar commented on GitHub (Apr 18, 2024):

I'm getting similar messages with dual 3090 cards

<!-- gh-comment-id:2063938140 --> @nkeilar commented on GitHub (Apr 18, 2024): I'm getting similar messages with dual 3090 cards
Author
Owner

@one-bit commented on GitHub (Apr 18, 2024):

I'm getting the same error. I'm running ollama 0.1.32 on Linux. I have a RTX 4090 and a GTX 1080 TI, and 128 GB of DDR4 RAM, and an Intel Core i9-7980XE with 18 cores/36 threads.

ollama run wizardlm2:8x22b-q2_K

Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-502d8aaa76270a2f3faa4bf4c7aa6fc7f890cd14faf4885e2a78889cc7953195'

ollama --version

ollama version is 0.1.32

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:17:00.0  On |                    0 |
|  0%   54C    P5             35W /  450W |    4190MiB /  23028MiB |     42%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:65:00.0 Off |                  N/A |
|  0%   46C    P8             13W /  275W |     146MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

screenfetch

 ██████████████████  ████████     
 ██████████████████  ████████     OS: Manjaro 23.1.4 Vulcan
 ██████████████████  ████████     Kernel: x86_64 Linux 6.8.5-1-MANJARO
 ██████████████████  ████████     Uptime: 3d 7h 46m
 ████████            ████████     Packages: 2418
 ████████  ████████  ████████     Shell: fish 3.7.0
 ████████  ████████  ████████     Resolution: 7680x1440
 ████████  ████████  ████████     DE: KDE 5.115.0 / Plasma 5.27.11
 ████████  ████████  ████████     WM: KWin
 ████████  ████████  ████████     GTK Theme: Breeze [GTK2/3]

 ████████  ████████  ████████     Icon Theme: breeze
 ████████  ████████  ████████     Disk: 4,7T / 20T (24%)
 ████████  ████████  ████████     CPU: Intel Core i9-7980XE @ 36x 4.2GHz [37.0°C]
 ████████  ████████  ████████     GPU: NVIDIA GeForce RTX 4090, NVIDIA GeForce GTX 1080 Ti
                                  RAM: 33517MiB / 128494MiB
<!-- gh-comment-id:2064324603 --> @one-bit commented on GitHub (Apr 18, 2024): I'm getting the same error. I'm running ollama 0.1.32 on Linux. I have a RTX 4090 and a GTX 1080 TI, and 128 GB of DDR4 RAM, and an Intel Core i9-7980XE with 18 cores/36 threads. ### ollama run wizardlm2:8x22b-q2_K ``` Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-502d8aaa76270a2f3faa4bf4c7aa6fc7f890cd14faf4885e2a78889cc7953195' ``` ### ollama --version ``` ollama version is 0.1.32 ``` ### nvidia-smi ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:17:00.0 On | 0 | | 0% 54C P5 35W / 450W | 4190MiB / 23028MiB | 42% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce GTX 1080 Ti Off | 00000000:65:00.0 Off | N/A | | 0% 46C P8 13W / 275W | 146MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` ### screenfetch ``` ██████████████████ ████████ ██████████████████ ████████ OS: Manjaro 23.1.4 Vulcan ██████████████████ ████████ Kernel: x86_64 Linux 6.8.5-1-MANJARO ██████████████████ ████████ Uptime: 3d 7h 46m ████████ ████████ Packages: 2418 ████████ ████████ ████████ Shell: fish 3.7.0 ████████ ████████ ████████ Resolution: 7680x1440 ████████ ████████ ████████ DE: KDE 5.115.0 / Plasma 5.27.11 ████████ ████████ ████████ WM: KWin ████████ ████████ ████████ GTK Theme: Breeze [GTK2/3] ████████ ████████ ████████ Icon Theme: breeze ████████ ████████ ████████ Disk: 4,7T / 20T (24%) ████████ ████████ ████████ CPU: Intel Core i9-7980XE @ 36x 4.2GHz [37.0°C] ████████ ████████ ████████ GPU: NVIDIA GeForce RTX 4090, NVIDIA GeForce GTX 1080 Ti RAM: 33517MiB / 128494MiB ```
Author
Owner

@s00x3r commented on GitHub (Apr 18, 2024):

same Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa'

<!-- gh-comment-id:2064657752 --> @s00x3r commented on GitHub (Apr 18, 2024): same Error: llama runner process no longer running: 1 error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa'
Author
Owner

@MarkWard0110 commented on GitHub (Apr 18, 2024):

I'm getting the same error with the new models.

Intel Core i9 14900k
DDR5 6400 2x48GB (96GB)
Nvidia RTX 4070 TI Super 16GB

Apr 18 18:57:54 quorra ollama[1170]: time=2024-04-18T18:57:54.713Z level=INFO source=routes.go:97 msg="changing loaded model"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.128Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.174Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=8 layers=8 required="71518.7 MiB" used="14828.9 MiB" available="15857.2 MiB" kv="320.0 MiB" fulloffload="320.0 MiB" partialoffload="320.0 MiB"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1615772994/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 8 --port 41305"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding"
Apr 18 18:57:55 quorra ollama[20191]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140044021682176","timestamp":1713466675}
Apr 18 18:57:55 quorra ollama[20191]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140044021682176","timestamp":1713466675}
Apr 18 18:57:55 quorra ollama[20191]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140044021682176","timestamp":1713466675,"total_threads":32}
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: loaded meta data with 24 key-value pairs and 323 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 (version GGUF V3 (latest))
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   0:                       general.architecture str              = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   1:                               general.name str              = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   2:                           dbrx.block_count u32              = 40
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   3:                        dbrx.context_length u32              = 32768
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   4:                      dbrx.embedding_length u32              = 6144
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   5:                   dbrx.feed_forward_length u32              = 10752
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   6:                  dbrx.attention.head_count u32              = 48
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   7:               dbrx.attention.head_count_kv u32              = 8
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   8:                        dbrx.rope.freq_base f32              = 500000.000000
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   9:                   dbrx.attention.clamp_kqv f32              = 8.000000
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  11:                          dbrx.expert_count u32              = 16
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  12:                     dbrx.expert_used_count u32              = 4
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  13:          dbrx.attention.layer_norm_epsilon f32              = 0.000010
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 100257
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 100257
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 100257
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 100277
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type  f32:   81 tensors
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type  f16:   40 tensors
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q4_0:  201 tensors
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q6_K:    1 tensors
Apr 18 18:57:55 quorra ollama[1170]: llm_load_vocab: special tokens definition check successful ( 96/100352 ).
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: format           = GGUF V3 (latest)
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: arch             = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: vocab type       = BPE
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_vocab          = 100352
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_merges         = 100000
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ctx_train      = 32768
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd           = 6144
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head           = 48
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head_kv        = 8
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_layer          = 40
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_rot            = 128
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_k    = 128
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_v    = 128
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_gqa            = 6
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_k_gqa     = 1024
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_v_gqa     = 1024
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_eps       = 1.0e-05
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_clamp_kqv      = 8.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ff             = 10752
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert         = 16
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert_used    = 4
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: causal attn      = 1
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: pooling type     = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope type        = 2
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope scaling     = linear
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_base_train  = 500000.0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_scale_train = 1
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope_finetuned   = unknown
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_conv       = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_inner      = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_state      = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model type       = 16x12B
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model ftype      = Q4_0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model params     = 131.60 B
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model size       = 69.09 GiB (4.51 BPW)
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: general.name     = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: BOS token        = 100257 '<|endoftext|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: EOS token        = 100257 '<|endoftext|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: UNK token        = 100257 '<|endoftext|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: PAD token        = 100277 '<|pad|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: LF token         = 128 'Ä'
Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: found 1 CUDA devices:
Apr 18 18:57:55 quorra ollama[1170]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Apr 18 18:57:55 quorra ollama[1170]: llm_load_tensors: ggml ctx size =    0.74 MiB
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloading 8 repeating layers to GPU
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloaded 8/41 layers to GPU
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors:        CPU buffer size = 70752.49 MiB
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors:      CUDA0 buffer size = 13987.88 MiB
Apr 18 18:58:19 quorra ollama[1170]: ....................................................................................................
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ctx      = 2048
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_batch    = 512
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ubatch   = 512
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_base  = 500000.0
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_scale = 1
Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init:  CUDA_Host KV buffer size =   256.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init:      CUDA0 KV buffer size =    64.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: KV self size  =  320.00 MiB, K (f16):  160.00 MiB, V (f16):  160.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.41 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model:      CUDA0 compute buffer size =  1794.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph nodes  = 2886
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph splits = 325
Apr 18 18:58:20 quorra ollama[1170]: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED
Apr 18 18:58:20 quorra ollama[1170]:   current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526
Apr 18 18:58:20 quorra ollama[1170]:   cublasCreate_v2(&cublas_handles[device])
Apr 18 18:58:20 quorra ollama[1170]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
Apr 18 18:58:20 quorra ollama[1170]: time=2024-04-18T18:58:20.740Z level=ERROR source=routes.go:120 msg="error loading llama server" error="llama runner process no longer running: -1 CUDA error: CUBLAS_STATUS_NOT_INITIALIZED\n  current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526\n  cublasCreate_v2(&cublas_handles[device])\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !\"CUDA error\""
Apr 18 18:58:20 quorra ollama[1170]: [GIN] 2024/04/18 - 18:58:20 | 500 | 26.029470037s |      10.0.0.123 | POST     "/api/generate"
<!-- gh-comment-id:2065292164 --> @MarkWard0110 commented on GitHub (Apr 18, 2024): I'm getting the same error with the new models. Intel Core i9 14900k DDR5 6400 2x48GB (96GB) Nvidia RTX 4070 TI Super 16GB ``` Apr 18 18:57:54 quorra ollama[1170]: time=2024-04-18T18:57:54.713Z level=INFO source=routes.go:97 msg="changing loaded model" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.128Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.174Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=8 layers=8 required="71518.7 MiB" used="14828.9 MiB" available="15857.2 MiB" kv="320.0 MiB" fulloffload="320.0 MiB" partialoffload="320.0 MiB" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1615772994/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 8 --port 41305" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding" Apr 18 18:57:55 quorra ollama[20191]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140044021682176","timestamp":1713466675} Apr 18 18:57:55 quorra ollama[20191]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140044021682176","timestamp":1713466675} Apr 18 18:57:55 quorra ollama[20191]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140044021682176","timestamp":1713466675,"total_threads":32} Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: loaded meta data with 24 key-value pairs and 323 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 (version GGUF V3 (latest)) Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 0: general.architecture str = dbrx Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 1: general.name str = dbrx Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 2: dbrx.block_count u32 = 40 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 3: dbrx.context_length u32 = 32768 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 4: dbrx.embedding_length u32 = 6144 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 5: dbrx.feed_forward_length u32 = 10752 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 6: dbrx.attention.head_count u32 = 48 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 7: dbrx.attention.head_count_kv u32 = 8 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 8: dbrx.rope.freq_base f32 = 500000.000000 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 9: dbrx.attention.clamp_kqv f32 = 8.000000 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 10: general.file_type u32 = 2 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 11: dbrx.expert_count u32 = 16 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 12: dbrx.expert_used_count u32 = 4 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 13: dbrx.attention.layer_norm_epsilon f32 = 0.000010 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 100257 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 100257 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 100257 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 100277 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 22: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 23: general.quantization_version u32 = 2 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type f32: 81 tensors Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type f16: 40 tensors Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q4_0: 201 tensors Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q6_K: 1 tensors Apr 18 18:57:55 quorra ollama[1170]: llm_load_vocab: special tokens definition check successful ( 96/100352 ). Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: format = GGUF V3 (latest) Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: arch = dbrx Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: vocab type = BPE Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_vocab = 100352 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_merges = 100000 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ctx_train = 32768 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd = 6144 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head = 48 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head_kv = 8 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_layer = 40 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_rot = 128 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_k = 128 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_v = 128 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_gqa = 6 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_k_gqa = 1024 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_v_gqa = 1024 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_eps = 1.0e-05 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_clamp_kqv = 8.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_logit_scale = 0.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ff = 10752 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert = 16 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert_used = 4 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: causal attn = 1 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: pooling type = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope type = 2 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope scaling = linear Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_base_train = 500000.0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_scale_train = 1 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_yarn_orig_ctx = 32768 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope_finetuned = unknown Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_conv = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_inner = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_state = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_dt_rank = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model type = 16x12B Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model ftype = Q4_0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model params = 131.60 B Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model size = 69.09 GiB (4.51 BPW) Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: general.name = dbrx Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: BOS token = 100257 '<|endoftext|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: EOS token = 100257 '<|endoftext|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: UNK token = 100257 '<|endoftext|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: PAD token = 100277 '<|pad|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: LF token = 128 'Ä' Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: found 1 CUDA devices: Apr 18 18:57:55 quorra ollama[1170]: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes Apr 18 18:57:55 quorra ollama[1170]: llm_load_tensors: ggml ctx size = 0.74 MiB Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloading 8 repeating layers to GPU Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloaded 8/41 layers to GPU Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: CPU buffer size = 70752.49 MiB Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: CUDA0 buffer size = 13987.88 MiB Apr 18 18:58:19 quorra ollama[1170]: .................................................................................................... Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ctx = 2048 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_batch = 512 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ubatch = 512 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_base = 500000.0 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_scale = 1 Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init: CUDA_Host KV buffer size = 256.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init: CUDA0 KV buffer size = 64.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16): 160.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: CUDA_Host output buffer size = 0.41 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: CUDA0 compute buffer size = 1794.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph nodes = 2886 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph splits = 325 Apr 18 18:58:20 quorra ollama[1170]: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED Apr 18 18:58:20 quorra ollama[1170]: current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526 Apr 18 18:58:20 quorra ollama[1170]: cublasCreate_v2(&cublas_handles[device]) Apr 18 18:58:20 quorra ollama[1170]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error" Apr 18 18:58:20 quorra ollama[1170]: time=2024-04-18T18:58:20.740Z level=ERROR source=routes.go:120 msg="error loading llama server" error="llama runner process no longer running: -1 CUDA error: CUBLAS_STATUS_NOT_INITIALIZED\n current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526\n cublasCreate_v2(&cublas_handles[device])\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !\"CUDA error\"" Apr 18 18:58:20 quorra ollama[1170]: [GIN] 2024/04/18 - 18:58:20 | 500 | 26.029470037s | 10.0.0.123 | POST "/api/generate" ```
Author
Owner

@KolioM commented on GitHub (Apr 19, 2024):

I have the exact same issue on both WIndows and Linux using two 3090, latest driver...

<!-- gh-comment-id:2065842215 --> @KolioM commented on GitHub (Apr 19, 2024): I have the exact same issue on both WIndows and Linux using two 3090, latest driver...
Author
Owner

@s00x3r commented on GitHub (Apr 19, 2024):

Running not in virtual machine: also cpu have
from /proc/cpuinfo
flags :avx avx2

<!-- gh-comment-id:2066704317 --> @s00x3r commented on GitHub (Apr 19, 2024): Running not in virtual machine: also cpu have from /proc/cpuinfo flags :avx avx2
Author
Owner

@KolioM commented on GitHub (Apr 19, 2024):

Btw llama 3 70b is running smoothly,
it is only Mixtral 8x22b models.

<!-- gh-comment-id:2066791881 --> @KolioM commented on GitHub (Apr 19, 2024): Btw llama 3 70b is running smoothly, it is only Mixtral 8x22b models.
Author
Owner

@s00x3r commented on GitHub (Apr 21, 2024):

FOUND PROBLEM!
Then ollama start model 8x22b with this it works( pay attension the only difference in this parameter (--n-gpu-layers)
working(also if i start --n-gpu-layers 25 working etc) :
ollama[547518]: time=2024-04-21T10:20:54.906+03:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1719319998/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 23 --port 4123"

not working
ollama[547518]: time=2024-04-21T10:11:26.056+03:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1719319998/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --port 30935"

<!-- gh-comment-id:2067937136 --> @s00x3r commented on GitHub (Apr 21, 2024): FOUND PROBLEM! Then ollama start model 8x22b with this it works( pay attension the only difference in this parameter (--n-gpu-layers) working(also if i start --n-gpu-layers 25 working etc) : ollama[547518]: time=2024-04-21T10:20:54.906+03:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1719319998/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 23 --port 4123" not working ollama[547518]: time=2024-04-21T10:11:26.056+03:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1719319998/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-cfcf93119280c4a10c1df57335bad341e000cabbc4faff125531d941a5b0befa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --port 30935"
Author
Owner

@mz2 commented on GitHub (Apr 21, 2024):

Nice, same story here: in my case the cuda malloc fail inducing command that I see with mixtral:8x22b is, as extracted from the log:

/tmp/ollama346288213/runners/cuda_v12/ollama_llama_server --model /media/data/ollama/blobs/sha256-7ec0c94a95cafef2780d00679e83f172ac343bc828aebbe2a5475fbe2daf76ff --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 51 --port 43359

The same command happily runs if I reduce just 1 layer from the GPU offload: --n-gpu-layers 50 runs happily. Deducting 1 from ollama's estimate for the # of layers appears to be enough to get also dbrx, wizardlm2 running on my dual-GPU setup with cuda_v12 -- I also confirmed this with a local build of ollama where I did this:

	if opts.NumGPU >= 0 {
		params = append(params, "--n-gpu-layers", fmt.Sprintf("%d", opts.NumGPU - 1)) // lol
	}

(Also, unsurprisingly I can also see that the CPU based library is not affected, i.e. OLLAMA_LLM_LIBRARY="cpu_avx2" ollama run mixtral:8x22b also works around the issue.)

<!-- gh-comment-id:2068120103 --> @mz2 commented on GitHub (Apr 21, 2024): Nice, same story here: in my case the cuda malloc fail inducing command that I see with mixtral:8x22b is, as extracted from the log: ```bash /tmp/ollama346288213/runners/cuda_v12/ollama_llama_server --model /media/data/ollama/blobs/sha256-7ec0c94a95cafef2780d00679e83f172ac343bc828aebbe2a5475fbe2daf76ff --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 51 --port 43359 ``` The same command happily runs if I reduce just 1 layer from the GPU offload: `--n-gpu-layers 50` runs happily. Deducting 1 from ollama's estimate for the # of layers appears to be enough to get also dbrx, wizardlm2 running on my dual-GPU setup with `cuda_v12` -- I also confirmed this with a local build of ollama where I did this: ```golang if opts.NumGPU >= 0 { params = append(params, "--n-gpu-layers", fmt.Sprintf("%d", opts.NumGPU - 1)) // lol } ``` (Also, unsurprisingly I can also see that the CPU based library is not affected, i.e. `OLLAMA_LLM_LIBRARY="cpu_avx2"` ollama run mixtral:8x22b also works around the issue.)
Author
Owner

@one-bit commented on GitHub (Apr 21, 2024):

In my case, I was able to work around the issue by creating a custom Modelfile with a custom value for the num_gpu parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18).

Here's my custom Modelfile:

FROM wizardlm2:8x22b

PARAMETER num_gpu 18

Then I created a new model entry in ollama using this Modelfile:

ollama create wizardlm-fix -f Modelfile

And finally I was able to load the model using:

ollama run wizardlm-fix

Please note that this does not fix the issue per se, but it may be a temporary workaround if you really want to run/test these models.

<!-- gh-comment-id:2068263429 --> @one-bit commented on GitHub (Apr 21, 2024): In my case, I was able to work around the issue by creating a custom `Modelfile` with a custom value for the `num_gpu` parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18). Here's my custom `Modelfile`: ``` FROM wizardlm2:8x22b PARAMETER num_gpu 18 ``` Then I created a new model entry in `ollama` using this `Modelfile`: ``` ollama create wizardlm-fix -f Modelfile ``` And finally I was able to load the model using: ``` ollama run wizardlm-fix ``` Please note that this does not fix the issue _per se_, but it may be a temporary workaround if you really want to run/test these models.
Author
Owner

@nkeilar commented on GitHub (Apr 24, 2024):

This is what worked for me with a dual 3090 setup, and 14900k. Note I limited the context to something quite short as a proof of concept. Response tokens 8.6 tok/sec

FROM wizardlm2:8x22b-q2_K
TEMPLATE """{{ if .System }}{{ .System }} {{ end }}{{ if .Prompt }}USER: {{ .Prompt }} {{ end }}ASSISTANT: {{ .Response }}"""
SYSTEM """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."""
PARAMETER stop "USER:"
PARAMETER stop "ASSISTANT:"
PARAMETER num_ctx 2000
PARAMETER num_gpu 40
<!-- gh-comment-id:2074269625 --> @nkeilar commented on GitHub (Apr 24, 2024): This is what worked for me with a dual 3090 setup, and 14900k. Note I limited the context to something quite short as a proof of concept. Response tokens 8.6 tok/sec ``` FROM wizardlm2:8x22b-q2_K TEMPLATE """{{ if .System }}{{ .System }} {{ end }}{{ if .Prompt }}USER: {{ .Prompt }} {{ end }}ASSISTANT: {{ .Response }}""" SYSTEM """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.""" PARAMETER stop "USER:" PARAMETER stop "ASSISTANT:" PARAMETER num_ctx 2000 PARAMETER num_gpu 40 ```
Author
Owner

@plp38 commented on GitHub (May 23, 2024):

In my case, I was able to work around the issue by creating a custom Modelfile with a custom value for the num_gpu parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18).

Here's my custom Modelfile:

FROM wizardlm2:8x22b

PARAMETER num_gpu 18

Then I created a new model entry in ollama using this Modelfile:

ollama create wizardlm-fix -f Modelfile

And finally I was able to load the model using:

ollama run wizardlm-fix

Please note that this does not fix the issue per se, but it may be a temporary workaround if you really want to run/test these models.

Thank you, the 'num_gpu' paramater solved my issue with my GPU. However what does it mean to limit the number of layers, what is the impact ? Sorry for the stupid question, but are the others layers loaded in RAM ?

<!-- gh-comment-id:2126676596 --> @plp38 commented on GitHub (May 23, 2024): > In my case, I was able to work around the issue by creating a custom `Modelfile` with a custom value for the `num_gpu` parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18). > > Here's my custom `Modelfile`: > > ``` > FROM wizardlm2:8x22b > > PARAMETER num_gpu 18 > ``` > > Then I created a new model entry in `ollama` using this `Modelfile`: > > ``` > ollama create wizardlm-fix -f Modelfile > ``` > > And finally I was able to load the model using: > > ``` > ollama run wizardlm-fix > ``` > > Please note that this does not fix the issue _per se_, but it may be a temporary workaround if you really want to run/test these models. Thank you, the 'num_gpu' paramater solved my issue with my GPU. However what does it mean to limit the number of layers, what is the impact ? Sorry for the stupid question, but are the others layers loaded in RAM ?
Author
Owner

@one-bit commented on GitHub (May 28, 2024):

In my case, I was able to work around the issue by creating a custom Modelfile with a custom value for the num_gpu parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18).
Here's my custom Modelfile:

FROM wizardlm2:8x22b

PARAMETER num_gpu 18

Then I created a new model entry in ollama using this Modelfile:

ollama create wizardlm-fix -f Modelfile

And finally I was able to load the model using:

ollama run wizardlm-fix

Please note that this does not fix the issue per se, but it may be a temporary workaround if you really want to run/test these models.

Thank you, the 'num_gpu' paramater solved my issue with my GPU. However what does it mean to limit the number of layers, what is the impact ? Sorry for the stupid question, but are the others layers loaded in RAM ?

This parameter limits the number of model layers that can be loaded into the GPU VRAM (and processed by the GPU), the remaining layers in the model will be loaded into the computer's RAM (and processed by the CPU).

<!-- gh-comment-id:2135239383 --> @one-bit commented on GitHub (May 28, 2024): > > In my case, I was able to work around the issue by creating a custom `Modelfile` with a custom value for the `num_gpu` parameter, to limit the number of layers that can fit into my GPUs VRAM. I started with a value of 10 and increased it until it crashed again, then I kept the last value that worked (in my case 18). > > Here's my custom `Modelfile`: > > ``` > > FROM wizardlm2:8x22b > > > > PARAMETER num_gpu 18 > > ``` > > > > > > > > > > > > > > > > > > > > > > > > Then I created a new model entry in `ollama` using this `Modelfile`: > > ``` > > ollama create wizardlm-fix -f Modelfile > > ``` > > > > > > > > > > > > > > > > > > > > > > > > And finally I was able to load the model using: > > ``` > > ollama run wizardlm-fix > > ``` > > > > > > > > > > > > > > > > > > > > > > > > Please note that this does not fix the issue _per se_, but it may be a temporary workaround if you really want to run/test these models. > > Thank you, the 'num_gpu' paramater solved my issue with my GPU. However what does it mean to limit the number of layers, what is the impact ? Sorry for the stupid question, but are the others layers loaded in RAM ? This parameter limits the number of model layers that can be loaded into the GPU VRAM (and processed by the GPU), the remaining layers in the model will be loaded into the computer's RAM (and processed by the CPU).
Author
Owner

@dhiltgen commented on GitHub (Jun 1, 2024):

Can you see if the latest release has improved the situation on the larger mixtral models?

<!-- gh-comment-id:2143629195 --> @dhiltgen commented on GitHub (Jun 1, 2024): Can you see if the latest release has improved the situation on the larger mixtral models?
Author
Owner

@dhiltgen commented on GitHub (Jun 22, 2024):

The latest release (0.1.45) has improvements around mult-GPU prediction and layer splits which should help these situations. If you're still seeing OOM's after upgrading, please share an updated server log and I'll reopen the issue.

<!-- gh-comment-id:2183598987 --> @dhiltgen commented on GitHub (Jun 22, 2024): The latest release (0.1.45) has improvements around mult-GPU prediction and layer splits which should help these situations. If you're still seeing OOM's after upgrading, please share an updated server log and I'll reopen the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64322