[GH-ISSUE #5136] deepseek v2 memory prediction incorrect - "CUBLAS_STATUS_NOT_INITIALIZED" error or out-of-memory #49749

Open
opened 2026-04-28 12:51:42 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @tincore on GitHub (Jun 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5136

Originally assigned to: @mxyng on GitHub.

What is the issue?

Hi,

I noticed previous out of memory error fix at version 0.1.45-rc3. [https://github.com/ollama/ollama/issues/5113].

ollama run deepseek-coder-v2

Now I'm getting a cuda error. "CUBLAS_STATUS_NOT_INITIALIZED"

Other models are running fine.

jun 19 08:50:18 ollama[5814]: [GIN] 2024/06/19 - 08:50:18 | 200 |      27.095µs |       127.0.0.1 | HEAD     "/"
jun 19 08:50:18 ollama[5814]: [GIN] 2024/06/19 - 08:50:18 | 200 |    1.703694ms |       127.0.0.1 | POST     "/api/show"
jun 19 08:50:18 ollama[5814]: [GIN] 2024/06/19 - 08:50:18 | 200 |    1.297062ms |       127.0.0.1 | POST     "/api/show"
jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.042+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=28 layers.offload=21 layers.split="" memory.available="[7.4 GiB]" memory.required.full="9.5 GiB" memory.required.partial="7.4 GiB" memory.required.kv="432.0 MiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="8.4 GiB" memory.weights.repeating="8.3 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="212.0 MiB" memory.graph.partial="376.1 MiB"
jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.042+02:00 level=INFO source=server.go:359 msg="starting llama server" cmd="/tmp/ollama2297022718/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 21 --flash-attn --parallel 1 --port 33155"
jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.043+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1
jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.043+02:00 level=INFO source=server.go:547 msg="waiting for llama runner to start responding"
jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.043+02:00 level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server error"
jun 19 08:50:19 ollama[7938]: INFO [main] build info | build=1 commit="7c26775" tid="123888687239168" timestamp=1718779819
jun 19 08:50:19 ollama[7938]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123888687239168" timestamp=1718779819 total_threads=16
jun 19 08:50:19 ollama[7938]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="33155" tid="123888687239168" timestamp=1718779819
jun 19 08:50:19 ollama[5814]: llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 (version GGUF V3 (latest))
jun 19 08:50:19 ollama[5814]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv   1:                               general.name str              = DeepSeek-Coder-V2-Lite-Instruct
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv  37:               general.quantization_version u32              = 2
jun 19 08:50:19 ollama[5814]: llama_model_loader: - type  f32:  108 tensors
jun 19 08:50:19 ollama[5814]: llama_model_loader: - type q4_0:  268 tensors
jun 19 08:50:19 ollama[5814]: llama_model_loader: - type q6_K:    1 tensors
jun 19 08:50:19 ollama[5814]: llm_load_vocab: special tokens cache size = 2400
jun 19 08:50:19 ollama[5814]: llm_load_vocab: token to piece cache size = 0.6661 MB
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: format           = GGUF V3 (latest)
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: arch             = deepseek2
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: vocab type       = BPE
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_vocab          = 102400
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_merges         = 99757
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_ctx_train      = 163840
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_embd           = 2048
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_head           = 16
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_head_kv        = 16
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_layer          = 27
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_rot            = 64
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_embd_head_k    = 192
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_embd_head_v    = 128
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_gqa            = 1
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_embd_k_gqa     = 3072
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_embd_v_gqa     = 2048
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: f_norm_eps       = 0.0e+00
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: f_logit_scale    = 0.0e+00
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_ff             = 10944
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_expert         = 64
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_expert_used    = 6
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: causal attn      = 1
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: pooling type     = 0
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: rope type        = 0
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: rope scaling     = yarn
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: freq_base_train  = 10000.0
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: freq_scale_train = 0.025
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_ctx_orig_yarn  = 4096
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: rope_finetuned   = unknown
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: ssm_d_conv       = 0
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: ssm_d_inner      = 0
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: ssm_d_state      = 0
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: ssm_dt_rank      = 0
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: model type       = 16B
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: model ftype      = Q4_0
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: model params     = 15.71 B
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: model size       = 8.29 GiB (4.53 BPW)
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: general.name     = DeepSeek-Coder-V2-Lite-Instruct
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: BOS token        = 100000 '<|begin▁of▁sentence|>'
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: EOS token        = 100001 '<|end▁of▁sentence|>'
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: PAD token        = 100001 '<|end▁of▁sentence|>'
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: LF token         = 126 'Ä'
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_layer_dense_lead   = 1
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_lora_q             = 0
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_lora_kv            = 512
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_ff_exp             = 1408
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_expert_shared      = 2
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: expert_weights_scale = 1.0
jun 19 08:50:19 ollama[5814]: llm_load_print_meta: rope_yarn_log_mul    = 0.0707
jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.294+02:00 level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server loading model"
jun 19 08:50:19 ollama[5814]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
jun 19 08:50:19 ollama[5814]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
jun 19 08:50:19 ollama[5814]: ggml_cuda_init: found 1 CUDA devices:
jun 19 08:50:19 ollama[5814]:   Device 0: NVIDIA GeForce RTX 3070 Laptop GPU, compute capability 8.6, VMM: yes
jun 19 08:50:19 ollama[5814]: llm_load_tensors: ggml ctx size =    0.35 MiB
jun 19 08:50:20 ollama[5814]: llm_load_tensors: offloading 21 repeating layers to GPU
jun 19 08:50:20 ollama[5814]: llm_load_tensors: offloaded 21/28 layers to GPU
jun 19 08:50:20 ollama[5814]: llm_load_tensors:        CPU buffer size =  2222.30 MiB
jun 19 08:50:20 ollama[5814]: llm_load_tensors:      CUDA0 buffer size =  6597.82 MiB
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: n_ctx      = 2048
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: n_batch    = 512
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: n_ubatch   = 512
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: flash_attn = 0
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: freq_base  = 10000.0
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: freq_scale = 0.025
jun 19 08:50:21 ollama[5814]: llama_kv_cache_init:  CUDA_Host KV buffer size =   120.00 MiB
jun 19 08:50:21 ollama[5814]: llama_kv_cache_init:      CUDA0 KV buffer size =   420.00 MiB
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: KV self size  =  540.00 MiB, K (f16):  324.00 MiB, V (f16):  216.00 MiB
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.40 MiB
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model:      CUDA0 compute buffer size =   376.06 MiB
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model:  CUDA_Host compute buffer size =    14.01 MiB
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: graph nodes  = 1924
jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: graph splits = 96
jun 19 08:50:22 ollama[5814]: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED
jun 19 08:50:22 ollama[5814]:   current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:826
jun 19 08:50:22 ollama[5814]:   cublasCreate_v2(&cublas_handles[device])
jun 19 08:50:22 ollama[5814]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:100: !"CUDA error"
jun 19 08:50:22 ollama[7965]: Could not attach to process.  If your uid matches the uid of the target
jun 19 08:50:22 ollama[7965]: process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
jun 19 08:50:22 ollama[7965]: again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
jun 19 08:50:22 ollama[5814]: ptrace: Inappropriate ioctl for device.
jun 19 08:50:22 ollama[5814]: No stack.
jun 19 08:50:22 ollama[5814]: The program is not being run.
jun 19 08:50:22 ollama[5814]: time=2024-06-19T08:50:22.608+02:00 level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server error"
jun 19 08:50:22 ollama[5814]: time=2024-06-19T08:50:22.858+02:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) CUDA error: CUBLAS_STATUS_NOT_INITIALIZED\n  current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:826\n  cublasCreate_v2(&cublas_handles[device])\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:100: !\"CUDA error\""
jun 19 08:50:22 ollama[5814]: [GIN] 2024/06/19 - 08:50:22 | 500 |  4.532340316s |       127.0.0.1 | POST     "/api/chat"
jun 19 08:50:28 ollama[5814]: time=2024-06-19T08:50:28.078+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.219301918 model=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.04              Driver Version: 555.52.04      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   60C    P0             35W /  115W |     168MiB /   8192MiB |     32%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.45-rc3

Originally created by @tincore on GitHub (Jun 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5136 Originally assigned to: @mxyng on GitHub. ### What is the issue? Hi, I noticed previous out of memory error fix at version 0.1.45-rc3. [https://github.com/ollama/ollama/issues/5113]. ``` ollama run deepseek-coder-v2 ``` Now I'm getting a cuda error. "CUBLAS_STATUS_NOT_INITIALIZED" Other models are running fine. ``` jun 19 08:50:18 ollama[5814]: [GIN] 2024/06/19 - 08:50:18 | 200 | 27.095µs | 127.0.0.1 | HEAD "/" jun 19 08:50:18 ollama[5814]: [GIN] 2024/06/19 - 08:50:18 | 200 | 1.703694ms | 127.0.0.1 | POST "/api/show" jun 19 08:50:18 ollama[5814]: [GIN] 2024/06/19 - 08:50:18 | 200 | 1.297062ms | 127.0.0.1 | POST "/api/show" jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.042+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=28 layers.offload=21 layers.split="" memory.available="[7.4 GiB]" memory.required.full="9.5 GiB" memory.required.partial="7.4 GiB" memory.required.kv="432.0 MiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="8.4 GiB" memory.weights.repeating="8.3 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="212.0 MiB" memory.graph.partial="376.1 MiB" jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.042+02:00 level=INFO source=server.go:359 msg="starting llama server" cmd="/tmp/ollama2297022718/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 21 --flash-attn --parallel 1 --port 33155" jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.043+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1 jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.043+02:00 level=INFO source=server.go:547 msg="waiting for llama runner to start responding" jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.043+02:00 level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server error" jun 19 08:50:19 ollama[7938]: INFO [main] build info | build=1 commit="7c26775" tid="123888687239168" timestamp=1718779819 jun 19 08:50:19 ollama[7938]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123888687239168" timestamp=1718779819 total_threads=16 jun 19 08:50:19 ollama[7938]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="33155" tid="123888687239168" timestamp=1718779819 jun 19 08:50:19 ollama[5814]: llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 (version GGUF V3 (latest)) jun 19 08:50:19 ollama[5814]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 0: general.architecture str = deepseek2 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 1: general.name str = DeepSeek-Coder-V2-Lite-Instruct jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 2: deepseek2.block_count u32 = 27 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 2048 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 10944 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 16 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 16 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 11: general.file_type u32 = 2 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 14: deepseek2.attention.kv_lora_rank u32 = 512 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 15: deepseek2.attention.key_length u32 = 192 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 16: deepseek2.attention.value_length u32 = 128 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 17: deepseek2.expert_feed_forward_length u32 = 1408 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 18: deepseek2.expert_count u32 = 64 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 19: deepseek2.expert_shared_count u32 = 2 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 20: deepseek2.expert_weights_scale f32 = 1.000000 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 21: deepseek2.rope.dimension_count u32 = 64 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 22: deepseek2.rope.scaling.type str = yarn jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 23: deepseek2.rope.scaling.factor f32 = 40.000000 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 24: deepseek2.rope.scaling.original_context_length u32 = 4096 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 25: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.070700 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 27: tokenizer.ggml.pre str = deepseek-llm jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", ... jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 100000 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 100001 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 100001 jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 36: tokenizer.chat_template str = {% if not add_generation_prompt is de... jun 19 08:50:19 ollama[5814]: llama_model_loader: - kv 37: general.quantization_version u32 = 2 jun 19 08:50:19 ollama[5814]: llama_model_loader: - type f32: 108 tensors jun 19 08:50:19 ollama[5814]: llama_model_loader: - type q4_0: 268 tensors jun 19 08:50:19 ollama[5814]: llama_model_loader: - type q6_K: 1 tensors jun 19 08:50:19 ollama[5814]: llm_load_vocab: special tokens cache size = 2400 jun 19 08:50:19 ollama[5814]: llm_load_vocab: token to piece cache size = 0.6661 MB jun 19 08:50:19 ollama[5814]: llm_load_print_meta: format = GGUF V3 (latest) jun 19 08:50:19 ollama[5814]: llm_load_print_meta: arch = deepseek2 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: vocab type = BPE jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_vocab = 102400 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_merges = 99757 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_ctx_train = 163840 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_embd = 2048 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_head = 16 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_head_kv = 16 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_layer = 27 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_rot = 64 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_embd_head_k = 192 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_embd_head_v = 128 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_gqa = 1 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_embd_k_gqa = 3072 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_embd_v_gqa = 2048 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: f_norm_eps = 0.0e+00 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: f_logit_scale = 0.0e+00 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_ff = 10944 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_expert = 64 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_expert_used = 6 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: causal attn = 1 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: pooling type = 0 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: rope type = 0 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: rope scaling = yarn jun 19 08:50:19 ollama[5814]: llm_load_print_meta: freq_base_train = 10000.0 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: freq_scale_train = 0.025 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_ctx_orig_yarn = 4096 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: rope_finetuned = unknown jun 19 08:50:19 ollama[5814]: llm_load_print_meta: ssm_d_conv = 0 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: ssm_d_inner = 0 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: ssm_d_state = 0 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: ssm_dt_rank = 0 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: model type = 16B jun 19 08:50:19 ollama[5814]: llm_load_print_meta: model ftype = Q4_0 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: model params = 15.71 B jun 19 08:50:19 ollama[5814]: llm_load_print_meta: model size = 8.29 GiB (4.53 BPW) jun 19 08:50:19 ollama[5814]: llm_load_print_meta: general.name = DeepSeek-Coder-V2-Lite-Instruct jun 19 08:50:19 ollama[5814]: llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>' jun 19 08:50:19 ollama[5814]: llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>' jun 19 08:50:19 ollama[5814]: llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>' jun 19 08:50:19 ollama[5814]: llm_load_print_meta: LF token = 126 'Ä' jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_layer_dense_lead = 1 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_lora_q = 0 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_lora_kv = 512 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_ff_exp = 1408 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: n_expert_shared = 2 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: expert_weights_scale = 1.0 jun 19 08:50:19 ollama[5814]: llm_load_print_meta: rope_yarn_log_mul = 0.0707 jun 19 08:50:19 ollama[5814]: time=2024-06-19T08:50:19.294+02:00 level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server loading model" jun 19 08:50:19 ollama[5814]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes jun 19 08:50:19 ollama[5814]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no jun 19 08:50:19 ollama[5814]: ggml_cuda_init: found 1 CUDA devices: jun 19 08:50:19 ollama[5814]: Device 0: NVIDIA GeForce RTX 3070 Laptop GPU, compute capability 8.6, VMM: yes jun 19 08:50:19 ollama[5814]: llm_load_tensors: ggml ctx size = 0.35 MiB jun 19 08:50:20 ollama[5814]: llm_load_tensors: offloading 21 repeating layers to GPU jun 19 08:50:20 ollama[5814]: llm_load_tensors: offloaded 21/28 layers to GPU jun 19 08:50:20 ollama[5814]: llm_load_tensors: CPU buffer size = 2222.30 MiB jun 19 08:50:20 ollama[5814]: llm_load_tensors: CUDA0 buffer size = 6597.82 MiB jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: n_ctx = 2048 jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: n_batch = 512 jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: n_ubatch = 512 jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: flash_attn = 0 jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: freq_base = 10000.0 jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: freq_scale = 0.025 jun 19 08:50:21 ollama[5814]: llama_kv_cache_init: CUDA_Host KV buffer size = 120.00 MiB jun 19 08:50:21 ollama[5814]: llama_kv_cache_init: CUDA0 KV buffer size = 420.00 MiB jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: KV self size = 540.00 MiB, K (f16): 324.00 MiB, V (f16): 216.00 MiB jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: CUDA_Host output buffer size = 0.40 MiB jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: CUDA0 compute buffer size = 376.06 MiB jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: CUDA_Host compute buffer size = 14.01 MiB jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: graph nodes = 1924 jun 19 08:50:21 ollama[5814]: llama_new_context_with_model: graph splits = 96 jun 19 08:50:22 ollama[5814]: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED jun 19 08:50:22 ollama[5814]: current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:826 jun 19 08:50:22 ollama[5814]: cublasCreate_v2(&cublas_handles[device]) jun 19 08:50:22 ollama[5814]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:100: !"CUDA error" jun 19 08:50:22 ollama[7965]: Could not attach to process. If your uid matches the uid of the target jun 19 08:50:22 ollama[7965]: process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try jun 19 08:50:22 ollama[7965]: again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf jun 19 08:50:22 ollama[5814]: ptrace: Inappropriate ioctl for device. jun 19 08:50:22 ollama[5814]: No stack. jun 19 08:50:22 ollama[5814]: The program is not being run. jun 19 08:50:22 ollama[5814]: time=2024-06-19T08:50:22.608+02:00 level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server error" jun 19 08:50:22 ollama[5814]: time=2024-06-19T08:50:22.858+02:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) CUDA error: CUBLAS_STATUS_NOT_INITIALIZED\n current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:826\n cublasCreate_v2(&cublas_handles[device])\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:100: !\"CUDA error\"" jun 19 08:50:22 ollama[5814]: [GIN] 2024/06/19 - 08:50:22 | 500 | 4.532340316s | 127.0.0.1 | POST "/api/chat" jun 19 08:50:28 ollama[5814]: time=2024-06-19T08:50:28.078+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.219301918 model=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 ``` ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.52.04 Driver Version: 555.52.04 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3070 ... Off | 00000000:01:00.0 On | N/A | | N/A 60C P0 35W / 115W | 168MiB / 8192MiB | 32% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.45-rc3
GiteaMirror added the memorybugnvidia labels 2026-04-28 12:51:42 -05:00
Author
Owner

@tincore commented on GitHub (Jun 19, 2024):

Additionally.

Deepseek v2 fails with the same error deepseek-v2

So I guess this affects the whole moe family.

<!-- gh-comment-id:2177949724 --> @tincore commented on GitHub (Jun 19, 2024): Additionally. Deepseek v2 fails with the same error [deepseek-v2](https://ollama.com/library/deepseek-v2) So I guess this affects the whole moe family.
Author
Owner

@dhiltgen commented on GitHub (Jun 19, 2024):

I wasn't able to repro this failure on 0.1.45-rc3 on a RTX 4090 or 3060 running driver 550

Are you setting any other env vars to tune the system setup? If not, maybe it's a 555 quirk?

<!-- gh-comment-id:2179240088 --> @dhiltgen commented on GitHub (Jun 19, 2024): I wasn't able to repro this failure on 0.1.45-rc3 on a RTX 4090 or 3060 running driver 550 Are you setting any other env vars to tune the system setup? If not, maybe it's a 555 quirk?
Author
Owner

@tincore commented on GitHub (Jun 19, 2024):

Hi,

I removed the only one that I'm using before submitting the bug (OLLAMA_FLASH_ATTENTION).

I've just tried on another computer using, basically, the same OS and also Ryzen paired with a Nvidia 4060 16Gb and there the model works. Maybe it is related to the amount of memory. In that computer the OOM present in current stable version was happening.

+-----------------------------------------------------------------------------------------+                                                
| NVIDIA-SMI 555.52.04              Driver Version: 555.52.04      CUDA Version: 12.5     |                                                
|-----------------------------------------+------------------------+----------------------+                                                
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |                                                
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |                                                
|                                         |                        |               MIG M. |                                                
|=========================================+========================+======================|                                                
|   0  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:2B:00.0 Off |                  N/A |                                                
|  0%   38C    P8              6W /  165W |     138MiB /  16380MiB |      0%      Default |                                                
|                                         |                        |                  N/A |                                                
+-----------------------------------------+------------------------+----------------------+                                                
                                                                                              
<!-- gh-comment-id:2179355449 --> @tincore commented on GitHub (Jun 19, 2024): Hi, I removed the only one that I'm using before submitting the bug (OLLAMA_FLASH_ATTENTION). I've just tried on another computer using, basically, the same OS and also Ryzen paired with a Nvidia 4060 16Gb and there the model works. Maybe it is related to the amount of memory. In that computer the OOM present in current stable version was happening. ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.52.04 Driver Version: 555.52.04 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:2B:00.0 Off | N/A | | 0% 38C P8 6W / 165W | 138MiB / 16380MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ```
Author
Owner

@tincore commented on GitHub (Jun 19, 2024):

Probably not the most popular model. Specially for people like me that are not proficient in chinese (just found that out) :)

<!-- gh-comment-id:2179359086 --> @tincore commented on GitHub (Jun 19, 2024): Probably not the most popular model. Specially for people like me that are not proficient in chinese (just found that out) :)
Author
Owner

@dhiltgen commented on GitHub (Jun 19, 2024):

I'm curious if you force the num_gpu to one less than it's loading if it works better. Maybe this is a subtle oom failure mode?

<!-- gh-comment-id:2179384393 --> @dhiltgen commented on GitHub (Jun 19, 2024): I'm curious if you force the num_gpu to one less than it's loading if it works better. Maybe this is a subtle oom failure mode?
Author
Owner

@binaryc0de commented on GitHub (Jun 20, 2024):

I'm getting same issue with this model

jason@jason-LOQ-15APH8:~$ ollama run deepseek-coder-v2
Error: llama runner process has terminated: signal: aborted (core dumped) CUDA error: CUBLAS_STATUS_NOT_INITIALIZED
current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:653
cublasCreate_v2(&cublas_handles[device])
GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu💯 !"CUDA error"

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4050 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   34C    P5               7W /  60W |    581MiB /  6141MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

OS
Linux

GPU
Nvidia

CPU
No response

Ollama version
0.1.44

<!-- gh-comment-id:2179758118 --> @binaryc0de commented on GitHub (Jun 20, 2024): I'm getting same issue with this model jason@jason-LOQ-15APH8:~$ ollama run deepseek-coder-v2 Error: llama runner process has terminated: signal: aborted (core dumped) CUDA error: CUBLAS_STATUS_NOT_INITIALIZED current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:653 cublasCreate_v2(&cublas_handles[device]) GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:100: !"CUDA error" ```shell +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4050 ... Off | 00000000:01:00.0 On | N/A | | N/A 34C P5 7W / 60W | 581MiB / 6141MiB | 10% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ ``` OS Linux GPU Nvidia CPU No response Ollama version 0.1.44
Author
Owner

@tincore commented on GitHub (Jun 20, 2024):

I tried with the following modelfile (could not find any other way to do this in the documentation) and the model now loads.

FROM deepseek-coder-v2:latest
PARAMETER num_gpu 20

Log below: You can indeed see 20 layers loaded there. Additionally it's complaining about the chat template but I guess that is a totally different issue (I saw some tags mixed in some of the chat responses).


jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.193+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=20 layers.model=28 layers.offload=20 layers.split="" memory.available="[7.4 GiB]" memory.required.full="9.3 GiB" memory.required.partial="7.1 GiB" memory.required.kv="432.0 MiB" memory.required.allocations="[7.1 GiB]" memory.weights.total="8.4 GiB" memory.weights.repeating="8.3 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="212.0 MiB" memory.graph.partial="376.1 MiB"
jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.194+02:00 level=INFO source=server.go:359 msg="starting llama server" cmd="/tmp/ollama3276930808/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 20 --flash-attn --parallel 1 --port 42631"
jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.194+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1
jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.194+02:00 level=INFO source=server.go:547 msg="waiting for llama runner to start responding"
jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.194+02:00 level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server error"
jun 20 09:09:24 ollama[15933]: INFO [main] build info | build=1 commit="7c26775" tid="137974216376320" timestamp=1718867364                                                                                                                                                                                                                                                                                                
jun 20 09:09:24 ollama[15933]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="137974216376320" timestamp=1718867364 total_threads=16                                                                                                                                                                                                                                                                                                                                                                                                                      
jun 20 09:09:24 ollama[15933]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="42631" tid="137974216376320" timestamp=1718867364                                                                                                                                                                                                                                                        
jun 20 09:09:24 ollama[1558]: llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 (version GGUF V3 (latest))                                                                                                                                                                  
jun 20 09:09:24 ollama[1558]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                                                                                                                                                                                                                                                                                            
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv   0:                       general.architecture str              = deepseek2                                                                                                                                                                                                                                                                                        
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv   1:                               general.name str              = DeepSeek-Coder-V2-Lite-Instruct                                                                                                                                                                                                                                                                  
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27                                                                                                                                                                                                                                                                                               
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840                                                                                                                                                                                                                                                                                           
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048                                                                                                                                                                                                                                                                                             
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944                                                                                                                                                                                                                                                                                            
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16                                                                                                                                                                                                                                                                                               
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16                                                                                                                                                                                                                                                                                               
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000                                                                                                                                                                                                                                                                                     
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001                                                                                                                                                                                                                                                                                         
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6                                                                                                                                                                                                                                                                                                
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  11:                          general.file_type u32              = 2                                                                                                                                                                                                                                                                                                
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1                                                                                                                                                                                                                                                                                                
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400                                                                                                                                                                                                                                                                                           
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512                                                                                                                                                                                                                                                                                              
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192                                                                                                                                                                                                                                                                                              
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128                                                                                                                                                                                                                                                                                              
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408                                                                                                                                                                                                                                                                                             
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64                                                                                                                                                                                                                                                                                               
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2                                                                                                                                                                                                                                                                                                
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000                                                                                                                                                                                                                                                                                         
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64                                                                                                                                                                                                                                                                                               
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn                                                                                                                                                                                                                                                                                             
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000                                                                                                                                                                                                                                                                                        
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096                                                                                                                                                                                                                                                                                         
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700                                                                                                                                                                                                                                                                                         
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2                                                                                                                                                                                                                                                                                             
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm                                                                                                                                                                                                                                                                                     
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...                                                                                                                                                                                                                                                         
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...                                                                                                                                                                                                                                                         
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...                                                                                                                                                                                                                                                             
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000                                                                                                                                                                                                                                                                                           
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001                                                                                                                                                                                                                                                                                           
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001                                                                                                                                                                                                                                                                                           
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true                                                                                                                                                                                                                                                                                             
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false                                                                                                                                                                                                                                                                                            
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...                                                                                                                                                                                                                                                         
jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv  37:               general.quantization_version u32              = 2                                                                                                                                                                                                                                                                                                
jun 20 09:09:24 ollama[1558]: llama_model_loader: - type  f32:  108 tensors                                                                                                                                                                                                                                                                                                                                                
jun 20 09:09:24 ollama[1558]: llama_model_loader: - type q4_0:  268 tensors                                                                                                                                                                                                                                                                                                                                                
jun 20 09:09:24 ollama[1558]: llama_model_loader: - type q6_K:    1 tensors                                                                                                                                                                                                                                                                                                                                                
jun 20 09:09:24 ollama[1558]: llm_load_vocab: special tokens cache size = 2400                                                                                                                                                                                                                                                                                                                                             
jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.445+02:00 level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server loading model"                                                                                                                                                                                                                                            
jun 20 09:09:24 ollama[1558]: llm_load_vocab: token to piece cache size = 0.6661 MB                                                                                                                                                                                                                                                                                                                                        
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: format           = GGUF V3 (latest)                                                                                                                                                                                                                                                                                                                                     
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: arch             = deepseek2                                                                                                                                                                                                                                                                                                                                            
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: vocab type       = BPE                                                                                                                                                                                                                                                                                                                                                  
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_vocab          = 102400                                                                                                                                                                                                                                                                                                                                               
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_merges         = 99757                                                                                                                                                                                                                                                                                                                                                
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_ctx_train      = 163840                                                                                                                                                                                                                                                                                                                                               
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_embd           = 2048                                                                                                                                                                                                                                                                                                                                                 
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_head           = 16                                                                                                                                                                                                                                                                                                                                                   
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_head_kv        = 16                                                                                                                                                                                                                                                                                                                                                   
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_layer          = 27                                                                                                                                                                                                                                                                                                                                                   
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_rot            = 64                                                                                                                                                                                                                                                                                                                                                   
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_embd_head_k    = 192                                                                                                                                                                                                                                                                                                                                                  
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_embd_head_v    = 128                                                                                                                                                                                                                                                                                                                                                  
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_gqa            = 1                                                                                                                                                                                                                                                                                                                                                    
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_embd_k_gqa     = 3072                                                                                                                                                                                                                                                                                                                                                 
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_embd_v_gqa     = 2048                                                                                                                                                                                                                                                                                                                                                 
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: f_norm_eps       = 0.0e+00                                                                                                                                                                                                                                                                                                                                              
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06                                                                                                                                                                                                                                                                                                                                              
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00                                                                                                                                                                                                                                                                                                                                              
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00                                                                                                                                                                                                                                                                                                                                              
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: f_logit_scale    = 0.0e+00                                                                                                                                                                                                                                                                                                                                              
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_ff             = 10944                                                                                                                                                                                                                                                                                                                                                
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_expert         = 64                                                                                                                                                                                                                                                                                                                                                   
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_expert_used    = 6                                                                                                                                                                                                                                                                                                                                                    
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: causal attn      = 1
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: pooling type     = 0
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: rope type        = 0
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: rope scaling     = yarn
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: freq_base_train  = 10000.0
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: freq_scale_train = 0.025
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_ctx_orig_yarn  = 4096
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: rope_finetuned   = unknown
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: ssm_d_conv       = 0
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: ssm_d_inner      = 0
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: ssm_d_state      = 0
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: ssm_dt_rank      = 0
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: model type       = 16B
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: model ftype      = Q4_0
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: model params     = 15.71 B
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: model size       = 8.29 GiB (4.53 BPW)
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: general.name     = DeepSeek-Coder-V2-Lite-Instruct
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: BOS token        = 100000 '<|begin▁of▁sentence|>'
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: EOS token        = 100001 '<|end▁of▁sentence|>'
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: PAD token        = 100001 '<|end▁of▁sentence|>'
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: LF token         = 126 'Ä'
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_layer_dense_lead   = 1
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_lora_q             = 0
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_lora_kv            = 512
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_ff_exp             = 1408
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_expert_shared      = 2
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: expert_weights_scale = 1.0
jun 20 09:09:24 ollama[1558]: llm_load_print_meta: rope_yarn_log_mul    = 0.0707
jun 20 09:09:24 ollama[1558]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
jun 20 09:09:24 ollama[1558]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
jun 20 09:09:24 ollama[1558]: ggml_cuda_init: found 1 CUDA devices:
jun 20 09:09:24 ollama[1558]:   Device 0: NVIDIA GeForce RTX 3070 Laptop GPU, compute capability 8.6, VMM: yes
jun 20 09:09:24 ollama[1558]: llm_load_tensors: ggml ctx size =    0.35 MiB
jun 20 09:09:25 ollama[1558]: llm_load_tensors: offloading 20 repeating layers to GPU
jun 20 09:09:25 ollama[1558]: llm_load_tensors: offloaded 20/28 layers to GPU
jun 20 09:09:25 ollama[1558]: llm_load_tensors:        CPU buffer size =  2222.30 MiB
jun 20 09:09:25 ollama[1558]: llm_load_tensors:      CUDA0 buffer size =  6283.64 MiB
jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: n_ctx      = 2048
jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: n_batch    = 512
jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: n_ubatch   = 512
jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: flash_attn = 0
jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: freq_base  = 10000.0
jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: freq_scale = 0.025
jun 20 09:09:27 ollama[1558]: llama_kv_cache_init:  CUDA_Host KV buffer size =   140.00 MiB
jun 20 09:09:27 ollama[1558]: llama_kv_cache_init:      CUDA0 KV buffer size =   400.00 MiB
jun 20 09:09:27 ollama[1558]: llama_new_context_with_model: KV self size  =  540.00 MiB, K (f16):  324.00 MiB, V (f16):  216.00 MiB
jun 20 09:09:27 ollama[1558]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.40 MiB
jun 20 09:09:27 ollama[1558]: llama_new_context_with_model:      CUDA0 compute buffer size =   376.06 MiB
jun 20 09:09:27 ollama[1558]: llama_new_context_with_model:  CUDA_Host compute buffer size =    14.01 MiB
jun 20 09:09:27 ollama[1558]: llama_new_context_with_model: graph nodes  = 1924
jun 20 09:09:27 ollama[1558]: llama_new_context_with_model: graph splits = 112
jun 20 09:09:27 ollama[15933]: INFO [main] model loaded | tid="137974216376320" timestamp=1718867367
jun 20 09:09:27 ollama[15933]: ERROR [validate_model_chat_template] The chat template comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses | tid="137974216376320" timestamp=1718867367
jun 20 09:09:27 ollama[1558]: time=2024-06-20T09:09:27.958+02:00 level=INFO source=server.go:590 msg="llama runner started in 3.76 seconds"
jun 20 09:09:29 ollama[1558]: [GIN] 2024/06/20 - 09:09:29 | 200 |  6.100401864s |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:2179970695 --> @tincore commented on GitHub (Jun 20, 2024): I tried with the following modelfile (could not find any other way to do this in the documentation) and the model now loads. ``` FROM deepseek-coder-v2:latest PARAMETER num_gpu 20 ``` Log below: You can indeed see 20 layers loaded there. Additionally it's complaining about the chat template but I guess that is a totally different issue (I saw some tags mixed in some of the chat responses). ``` jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.193+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=20 layers.model=28 layers.offload=20 layers.split="" memory.available="[7.4 GiB]" memory.required.full="9.3 GiB" memory.required.partial="7.1 GiB" memory.required.kv="432.0 MiB" memory.required.allocations="[7.1 GiB]" memory.weights.total="8.4 GiB" memory.weights.repeating="8.3 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="212.0 MiB" memory.graph.partial="376.1 MiB" jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.194+02:00 level=INFO source=server.go:359 msg="starting llama server" cmd="/tmp/ollama3276930808/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 20 --flash-attn --parallel 1 --port 42631" jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.194+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1 jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.194+02:00 level=INFO source=server.go:547 msg="waiting for llama runner to start responding" jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.194+02:00 level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server error" jun 20 09:09:24 ollama[15933]: INFO [main] build info | build=1 commit="7c26775" tid="137974216376320" timestamp=1718867364 jun 20 09:09:24 ollama[15933]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="137974216376320" timestamp=1718867364 total_threads=16 jun 20 09:09:24 ollama[15933]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="42631" tid="137974216376320" timestamp=1718867364 jun 20 09:09:24 ollama[1558]: llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 (version GGUF V3 (latest)) jun 20 09:09:24 ollama[1558]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 0: general.architecture str = deepseek2 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 1: general.name str = DeepSeek-Coder-V2-Lite-Instruct jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 2: deepseek2.block_count u32 = 27 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 2048 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 10944 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 16 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 16 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 11: general.file_type u32 = 2 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 14: deepseek2.attention.kv_lora_rank u32 = 512 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 15: deepseek2.attention.key_length u32 = 192 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 16: deepseek2.attention.value_length u32 = 128 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 17: deepseek2.expert_feed_forward_length u32 = 1408 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 18: deepseek2.expert_count u32 = 64 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 19: deepseek2.expert_shared_count u32 = 2 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 20: deepseek2.expert_weights_scale f32 = 1.000000 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 21: deepseek2.rope.dimension_count u32 = 64 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 22: deepseek2.rope.scaling.type str = yarn jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 23: deepseek2.rope.scaling.factor f32 = 40.000000 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 24: deepseek2.rope.scaling.original_context_length u32 = 4096 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 25: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.070700 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 27: tokenizer.ggml.pre str = deepseek-llm jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", ... jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 100000 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 100001 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 100001 jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 36: tokenizer.chat_template str = {% if not add_generation_prompt is de... jun 20 09:09:24 ollama[1558]: llama_model_loader: - kv 37: general.quantization_version u32 = 2 jun 20 09:09:24 ollama[1558]: llama_model_loader: - type f32: 108 tensors jun 20 09:09:24 ollama[1558]: llama_model_loader: - type q4_0: 268 tensors jun 20 09:09:24 ollama[1558]: llama_model_loader: - type q6_K: 1 tensors jun 20 09:09:24 ollama[1558]: llm_load_vocab: special tokens cache size = 2400 jun 20 09:09:24 ollama[1558]: time=2024-06-20T09:09:24.445+02:00 level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server loading model" jun 20 09:09:24 ollama[1558]: llm_load_vocab: token to piece cache size = 0.6661 MB jun 20 09:09:24 ollama[1558]: llm_load_print_meta: format = GGUF V3 (latest) jun 20 09:09:24 ollama[1558]: llm_load_print_meta: arch = deepseek2 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: vocab type = BPE jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_vocab = 102400 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_merges = 99757 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_ctx_train = 163840 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_embd = 2048 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_head = 16 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_head_kv = 16 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_layer = 27 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_rot = 64 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_embd_head_k = 192 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_embd_head_v = 128 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_gqa = 1 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_embd_k_gqa = 3072 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_embd_v_gqa = 2048 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: f_norm_eps = 0.0e+00 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: f_logit_scale = 0.0e+00 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_ff = 10944 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_expert = 64 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_expert_used = 6 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: causal attn = 1 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: pooling type = 0 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: rope type = 0 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: rope scaling = yarn jun 20 09:09:24 ollama[1558]: llm_load_print_meta: freq_base_train = 10000.0 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: freq_scale_train = 0.025 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_ctx_orig_yarn = 4096 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: rope_finetuned = unknown jun 20 09:09:24 ollama[1558]: llm_load_print_meta: ssm_d_conv = 0 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: ssm_d_inner = 0 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: ssm_d_state = 0 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: ssm_dt_rank = 0 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: model type = 16B jun 20 09:09:24 ollama[1558]: llm_load_print_meta: model ftype = Q4_0 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: model params = 15.71 B jun 20 09:09:24 ollama[1558]: llm_load_print_meta: model size = 8.29 GiB (4.53 BPW) jun 20 09:09:24 ollama[1558]: llm_load_print_meta: general.name = DeepSeek-Coder-V2-Lite-Instruct jun 20 09:09:24 ollama[1558]: llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>' jun 20 09:09:24 ollama[1558]: llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>' jun 20 09:09:24 ollama[1558]: llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>' jun 20 09:09:24 ollama[1558]: llm_load_print_meta: LF token = 126 'Ä' jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_layer_dense_lead = 1 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_lora_q = 0 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_lora_kv = 512 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_ff_exp = 1408 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: n_expert_shared = 2 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: expert_weights_scale = 1.0 jun 20 09:09:24 ollama[1558]: llm_load_print_meta: rope_yarn_log_mul = 0.0707 jun 20 09:09:24 ollama[1558]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes jun 20 09:09:24 ollama[1558]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no jun 20 09:09:24 ollama[1558]: ggml_cuda_init: found 1 CUDA devices: jun 20 09:09:24 ollama[1558]: Device 0: NVIDIA GeForce RTX 3070 Laptop GPU, compute capability 8.6, VMM: yes jun 20 09:09:24 ollama[1558]: llm_load_tensors: ggml ctx size = 0.35 MiB jun 20 09:09:25 ollama[1558]: llm_load_tensors: offloading 20 repeating layers to GPU jun 20 09:09:25 ollama[1558]: llm_load_tensors: offloaded 20/28 layers to GPU jun 20 09:09:25 ollama[1558]: llm_load_tensors: CPU buffer size = 2222.30 MiB jun 20 09:09:25 ollama[1558]: llm_load_tensors: CUDA0 buffer size = 6283.64 MiB jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: n_ctx = 2048 jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: n_batch = 512 jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: n_ubatch = 512 jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: flash_attn = 0 jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: freq_base = 10000.0 jun 20 09:09:26 ollama[1558]: llama_new_context_with_model: freq_scale = 0.025 jun 20 09:09:27 ollama[1558]: llama_kv_cache_init: CUDA_Host KV buffer size = 140.00 MiB jun 20 09:09:27 ollama[1558]: llama_kv_cache_init: CUDA0 KV buffer size = 400.00 MiB jun 20 09:09:27 ollama[1558]: llama_new_context_with_model: KV self size = 540.00 MiB, K (f16): 324.00 MiB, V (f16): 216.00 MiB jun 20 09:09:27 ollama[1558]: llama_new_context_with_model: CUDA_Host output buffer size = 0.40 MiB jun 20 09:09:27 ollama[1558]: llama_new_context_with_model: CUDA0 compute buffer size = 376.06 MiB jun 20 09:09:27 ollama[1558]: llama_new_context_with_model: CUDA_Host compute buffer size = 14.01 MiB jun 20 09:09:27 ollama[1558]: llama_new_context_with_model: graph nodes = 1924 jun 20 09:09:27 ollama[1558]: llama_new_context_with_model: graph splits = 112 jun 20 09:09:27 ollama[15933]: INFO [main] model loaded | tid="137974216376320" timestamp=1718867367 jun 20 09:09:27 ollama[15933]: ERROR [validate_model_chat_template] The chat template comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses | tid="137974216376320" timestamp=1718867367 jun 20 09:09:27 ollama[1558]: time=2024-06-20T09:09:27.958+02:00 level=INFO source=server.go:590 msg="llama runner started in 3.76 seconds" jun 20 09:09:29 ollama[1558]: [GIN] 2024/06/20 - 09:09:29 | 200 | 6.100401864s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@dhiltgen commented on GitHub (Jun 20, 2024):

Looks like 0.1.45-rc4 is close, but still needs an additional fix to get the memory predictions right.

<!-- gh-comment-id:2181051609 --> @dhiltgen commented on GitHub (Jun 20, 2024): Looks like 0.1.45-rc4 is close, but still needs an additional fix to get the memory predictions right.
Author
Owner

@dhiltgen commented on GitHub (Jun 20, 2024):

Fixed by #5192

<!-- gh-comment-id:2181226340 --> @dhiltgen commented on GitHub (Jun 20, 2024): Fixed by #5192
Author
Owner

@ProjectMoon commented on GitHub (Jun 22, 2024):

I am having out of memory errors still when using Deepseek V2, when I've got plenty of memory remaining, with a context length of 8192. Or at least I think I have plenty of memory left. It happens when getting closer to the context limit. Lowering num_batch to 256 instead of 512 helps. Is there an issue to track this somewhere?

<!-- gh-comment-id:2184148984 --> @ProjectMoon commented on GitHub (Jun 22, 2024): I am having out of memory errors still when using Deepseek V2, when I've got plenty of memory remaining, with a context length of 8192. Or at least I think I have plenty of memory left. It happens when getting closer to the context limit. Lowering num_batch to 256 instead of 512 helps. Is there an issue to track this somewhere?
Author
Owner

@dhiltgen commented on GitHub (Jun 24, 2024):

@ProjectMoon can you share your server log showing the inference compute lines along with offload to cuda up to the point of crashing so I can see what we predicted, and how we crashed?

<!-- gh-comment-id:2186904212 --> @dhiltgen commented on GitHub (Jun 24, 2024): @ProjectMoon can you share your server log showing the `inference compute` lines along with `offload to cuda` up to the point of crashing so I can see what we predicted, and how we crashed?
Author
Owner

@ProjectMoon commented on GitHub (Jun 26, 2024):

Here are the logs. This is using the model without any forcing of specific amounts of GPU layers or disabling mmap. Context length is 8192 tokens. It's in a conversation with a bunch of tokens already in it (some large class class definitions and stuff). It offloads 23 layers, then crashes after the runner starts. If I force the layers down to 20 with num_gpu and num_batch down to 256, it runs fine, although somewhat slower, of course.

time=2024-06-26T09:40:19.540+02:00 level=INFO source=types.go:98 msg="inference compute" id=0 library=rocm compute=gfx1030 driver=0.0 name=1002:73bf total="16.0 GiB" available="16.0 GiB"
time=2024-06-26T09:40:30.940+02:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=28 layers.offload=23 layers.split="" memory.available="[16.0 GiB]" memory.required.full="18.6 GiB" memory.required.partial="15.8 GiB" memory.required.kv="4.2 GiB" memory.required.allocations="[15.8 GiB]" memory.weights.total="17.0 GiB" memory.weights.repeating="16.8 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="568.0 MiB" memory.graph.partial="759.4 MiB"
time=2024-06-26T09:40:30.991+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama3787188362/runners/rocm_v60101/ollama_llama_server --model /ollama/blobs/sha256-9850aa096dffd1720aa351dd72b9e9f10d9a132a1b58cf7994f747137a936931 --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 23 --no-mmap --parallel 2 --port 44925"
time=2024-06-26T09:40:30.991+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1
time=2024-06-26T09:40:30.991+02:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding"
time=2024-06-26T09:40:30.992+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /ollama/blobs/sha256-9850aa096dffd1720aa351dd72b9e9f10d9a132a1b58cf7994f747137a936931 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = DeepSeek-Coder-V2-Lite-Instruct
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 18
llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32
  = 4096
llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
time=2024-06-26T09:40:31.243+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type q8_0:   27 tensors
llama_model_loader: - type q6_K:  242 tensors
llm_load_vocab: special tokens cache size = 2400
llm_load_vocab: token to piece cache size = 0.6661 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 102400
llm_load_print_meta: n_merges         = 99757
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 27
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 10944
llm_load_print_meta: n_expert         = 64
llm_load_print_meta: n_expert_used    = 6
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 16B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 15.71 B
llm_load_print_meta: model size       = 13.10 GiB (7.16 BPW)
llm_load_print_meta: general.name     = DeepSeek-Coder-V2-Lite-Instruct
llm_load_print_meta: BOS token        = 100000 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 100001 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 100001 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 126 'Ä'
llm_load_print_meta: n_layer_dense_lead   = 1
llm_load_print_meta: n_lora_q             = 0
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 1408
llm_load_print_meta: n_expert_shared      = 2
llm_load_print_meta: expert_weights_scale = 1.0
llm_load_print_meta: rope_yarn_log_mul    = 0.0707
/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size =    0.35 MiB
time=2024-06-26T09:40:33.956+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
llm_load_tensors: offloading 23 repeating layers to GPU
llm_load_tensors: offloaded 23/28 layers to GPU
llm_load_tensors:      ROCm0 buffer size = 11513.11 MiB
llm_load_tensors:  ROCm_Host buffer size =  1898.40 MiB
time=2024-06-26T09:40:34.209+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      ROCm0 KV buffer size =  3680.00 MiB
llama_kv_cache_init:  ROCm_Host KV buffer size =   640.00 MiB
llama_new_context_with_model: KV self size  = 4320.00 MiB, K (f16): 2592.00 MiB, V (f16): 1728.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.80 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   728.92 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    42.01 MiB
llama_new_context_with_model: graph nodes  = 1924
llama_new_context_with_model: graph splits = 64
time=2024-06-26T09:40:49.785+02:00 level=INFO source=server.go:599 msg="llama runner started in 18.79 seconds"
CUDA error: out of memory
  current device: 0, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:290
  ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:100: !"CUDA error"
<!-- gh-comment-id:2191033412 --> @ProjectMoon commented on GitHub (Jun 26, 2024): Here are the logs. This is using the model without any forcing of specific amounts of GPU layers or disabling mmap. Context length is 8192 tokens. It's in a conversation with a bunch of tokens already in it (some large class class definitions and stuff). It offloads 23 layers, then crashes after the runner starts. If I force the layers down to 20 with `num_gpu` and `num_batch` down to 256, it runs fine, although somewhat slower, of course. ``` time=2024-06-26T09:40:19.540+02:00 level=INFO source=types.go:98 msg="inference compute" id=0 library=rocm compute=gfx1030 driver=0.0 name=1002:73bf total="16.0 GiB" available="16.0 GiB" time=2024-06-26T09:40:30.940+02:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=28 layers.offload=23 layers.split="" memory.available="[16.0 GiB]" memory.required.full="18.6 GiB" memory.required.partial="15.8 GiB" memory.required.kv="4.2 GiB" memory.required.allocations="[15.8 GiB]" memory.weights.total="17.0 GiB" memory.weights.repeating="16.8 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="568.0 MiB" memory.graph.partial="759.4 MiB" time=2024-06-26T09:40:30.991+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama3787188362/runners/rocm_v60101/ollama_llama_server --model /ollama/blobs/sha256-9850aa096dffd1720aa351dd72b9e9f10d9a132a1b58cf7994f747137a936931 --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 23 --no-mmap --parallel 2 --port 44925" time=2024-06-26T09:40:30.991+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1 time=2024-06-26T09:40:30.991+02:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding" time=2024-06-26T09:40:30.992+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /ollama/blobs/sha256-9850aa096dffd1720aa351dd72b9e9f10d9a132a1b58cf7994f747137a936931 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.name str = DeepSeek-Coder-V2-Lite-Instruct llama_model_loader: - kv 2: deepseek2.block_count u32 = 27 llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 2048 llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 10944 llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 16 llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6 llama_model_loader: - kv 11: general.file_type u32 = 18 llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1 llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400 llama_model_loader: - kv 14: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 15: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 16: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 17: deepseek2.expert_feed_forward_length u32 = 1408 llama_model_loader: - kv 18: deepseek2.expert_count u32 = 64 llama_model_loader: - kv 19: deepseek2.expert_shared_count u32 = 2 llama_model_loader: - kv 20: deepseek2.expert_weights_scale f32 = 1.000000 llama_model_loader: - kv 21: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 22: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 23: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 24: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 25: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.070700 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = deepseek-llm time=2024-06-26T09:40:31.243+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 100000 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 100001 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 100001 llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 36: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - type f32: 108 tensors llama_model_loader: - type q8_0: 27 tensors llama_model_loader: - type q6_K: 242 tensors llm_load_vocab: special tokens cache size = 2400 llm_load_vocab: token to piece cache size = 0.6661 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = deepseek2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 102400 llm_load_print_meta: n_merges = 99757 llm_load_print_meta: n_ctx_train = 163840 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 27 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 192 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 3072 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 10944 llm_load_print_meta: n_expert = 64 llm_load_print_meta: n_expert_used = 6 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = yarn llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 0.025 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 16B llm_load_print_meta: model ftype = Q6_K llm_load_print_meta: model params = 15.71 B llm_load_print_meta: model size = 13.10 GiB (7.16 BPW) llm_load_print_meta: general.name = DeepSeek-Coder-V2-Lite-Instruct llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>' llm_load_print_meta: LF token = 126 'Ä' llm_load_print_meta: n_layer_dense_lead = 1 llm_load_print_meta: n_lora_q = 0 llm_load_print_meta: n_lora_kv = 512 llm_load_print_meta: n_ff_exp = 1408 llm_load_print_meta: n_expert_shared = 2 llm_load_print_meta: expert_weights_scale = 1.0 llm_load_print_meta: rope_yarn_log_mul = 0.0707 /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.35 MiB time=2024-06-26T09:40:33.956+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding" llm_load_tensors: offloading 23 repeating layers to GPU llm_load_tensors: offloaded 23/28 layers to GPU llm_load_tensors: ROCm0 buffer size = 11513.11 MiB llm_load_tensors: ROCm_Host buffer size = 1898.40 MiB time=2024-06-26T09:40:34.209+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 0.025 llama_kv_cache_init: ROCm0 KV buffer size = 3680.00 MiB llama_kv_cache_init: ROCm_Host KV buffer size = 640.00 MiB llama_new_context_with_model: KV self size = 4320.00 MiB, K (f16): 2592.00 MiB, V (f16): 1728.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.80 MiB llama_new_context_with_model: ROCm0 compute buffer size = 728.92 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 42.01 MiB llama_new_context_with_model: graph nodes = 1924 llama_new_context_with_model: graph splits = 64 time=2024-06-26T09:40:49.785+02:00 level=INFO source=server.go:599 msg="llama runner started in 18.79 seconds" CUDA error: out of memory current device: 0, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:290 ggml_cuda_device_malloc(&ptr, look_ahead_size, device) GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:100: !"CUDA error" ```
Author
Owner

@ProjectMoon commented on GitHub (Jun 26, 2024):

Setting num_batch to 256 on the model without altering GPU layers also seems to help, allowing it to run, even with 8k context. Although I suspect that is simply pushing OOM error back deeper into the conversation. With num_batch set to 256, system RAM is stable at ~4 GB, and VRAM is of course used up with 23 layers offloaded.

Edit: yep, crashed one message later. However, I was able to regen the response, and it loaded the model and generated a response fine.

<!-- gh-comment-id:2191041226 --> @ProjectMoon commented on GitHub (Jun 26, 2024): Setting `num_batch` to 256 on the model without altering GPU layers also seems to help, allowing it to run, even with 8k context. Although I suspect that is simply pushing OOM error back deeper into the conversation. With num_batch set to 256, system RAM is stable at ~4 GB, and VRAM is of course used up with 23 layers offloaded. Edit: yep, crashed one message later. However, I was able to regen the response, and it loaded the model and generated a response fine.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49749