[GH-ISSUE #5913] Out of memory when offloading layers on ROCm #3690

Open
opened 2026-04-12 14:30:23 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @oleid on GitHub (Jul 24, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5913

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I've noticed there are other recent issues on offloading, however, since I'm using a different setup I thought opening a seperate issue would make sense. I neither use docker nor nvidia. But ROCm 6.1.3 on debian stable.

Before each test, I stopped ollama (since I hope that would free any remaining GPU memory claim).

With ollama running I get:

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           63508        2349         753           8       61127       61158
Swap:              0           0           0

Also, nothing claims the GPU (I use a desktop, however on the internal GPU (device 1)):

$ rocm-smi  --device=0  --showmemuse  --showpids


======================= ROCm System Management Interface =======================
============================== Current Memory Use ==============================
GPU[0]          : GPU memory use (%): 0
GPU[0]          : Memory Activity: N/A
================================================================================
================================ KFD Processes =================================
No KFD PIDs currently running
================================================================================
============================= End of ROCm SMI Log ==============================

When running the same command while I hit return in openwebui I get:

$ rocm-smi  --device=0  --showmemuse  --showpids


======================= ROCm System Management Interface =======================
============================== Current Memory Use ==============================
GPU[0]          : GPU memory use (%): 2
GPU[0]          : Memory Activity: N/A
================================================================================
================================ KFD Processes =================================
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
115056  ollama_llama_se 1       25240158208     0               0
================================================================================
============================= End of ROCm SMI Log ==============================

grafik

Sample output of the error

[Output of ollama service, click to expand]
Jul 24 16:07:04 vega systemd[1]: Stopping ollama.service - Ollama Service...
Jul 24 16:07:04 vega systemd[1]: ollama.service: Deactivated successfully.
Jul 24 16:07:04 vega systemd[1]: Stopped ollama.service - Ollama Service.
Jul 24 16:07:04 vega systemd[1]: Started ollama.service - Ollama Service.
Jul 24 16:07:04 vega ollama[114917]: 2024/07/24 16:07:04 routes.go:1100: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
Jul 24 16:07:04 vega ollama[114917]: time=2024-07-24T16:07:04.312+02:00 level=INFO source=images.go:784 msg="total blobs: 47"
Jul 24 16:07:04 vega ollama[114917]: time=2024-07-24T16:07:04.312+02:00 level=INFO source=images.go:791 msg="total unused blobs removed: 0"
Jul 24 16:07:04 vega ollama[114917]: time=2024-07-24T16:07:04.312+02:00 level=INFO source=routes.go:1147 msg="Listening on [::]:11434 (version 0.2.8)"
Jul 24 16:07:04 vega ollama[114917]: time=2024-07-24T16:07:04.313+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3590369896/runners
Jul 24 16:07:06 vega ollama[114917]: time=2024-07-24T16:07:06.996+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60102 cpu]"
Jul 24 16:07:06 vega ollama[114917]: time=2024-07-24T16:07:06.996+02:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
Jul 24 16:07:07 vega ollama[114917]: time=2024-07-24T16:07:07.004+02:00 level=INFO source=amd_linux.go:330 msg="amdgpu is supported" gpu=0 gpu_type=gfx1100
Jul 24 16:07:07 vega ollama[114917]: time=2024-07-24T16:07:07.004+02:00 level=INFO source=amd_linux.go:259 msg="unsupported Radeon iGPU detected skipping" id=1 total="512.0 MiB"
Jul 24 16:07:07 vega ollama[114917]: time=2024-07-24T16:07:07.004+02:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=rocm compute=gfx1100 driver=6.7 name=1002:744c total="24.0 GiB" available="24.0 GiB"
Jul 24 16:07:51 vega ollama[114917]: time=2024-07-24T16:07:51.812+02:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=48 layers.split="" memory.available="[24.0 GiB]" memory.required.full="39.3 GiB" memory.required.partial="23.9 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.9 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"
Jul 24 16:07:51 vega ollama[114917]: time=2024-07-24T16:07:51.813+02:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3590369896/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-0bd51f8f0c975ce910ed067dcb962a9af05b77bafcdc595ef02178387f10e51d --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 48 --parallel 1 --port 36639"
Jul 24 16:07:51 vega ollama[114917]: time=2024-07-24T16:07:51.813+02:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
Jul 24 16:07:51 vega ollama[114917]: time=2024-07-24T16:07:51.813+02:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
Jul 24 16:07:51 vega ollama[114917]: time=2024-07-24T16:07:51.813+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
Jul 24 16:07:51 vega ollama[115056]: INFO [main] build info | build=1 commit="d94c6e0" tid="140302582510400" timestamp=1721830071
Jul 24 16:07:51 vega ollama[115056]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140302582510400" timestamp=1721830071 total_threads=16
Jul 24 16:07:51 vega ollama[115056]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="36639" tid="140302582510400" timestamp=1721830071
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-0bd51f8f0c975ce910ed067dcb962a9af05b77bafcdc595ef02178387f10e51d (version GGUF V3 (latest))
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-70B-Instruct
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv   2:                          llama.block_count u32              = 80
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 64
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv  21:               general.quantization_version u32              = 2
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - type  f32:  161 tensors
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - type q4_0:  561 tensors
Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - type q6_K:    1 tensors
Jul 24 16:07:52 vega ollama[114917]: time=2024-07-24T16:07:52.065+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model"
Jul 24 16:07:52 vega ollama[114917]: llm_load_vocab: special tokens cache size = 256
Jul 24 16:07:52 vega ollama[114917]: llm_load_vocab: token to piece cache size = 0.8000 MB
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: format           = GGUF V3 (latest)
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: arch             = llama
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: vocab type       = BPE
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_vocab          = 128256
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_merges         = 280147
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: vocab_only       = 0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_ctx_train      = 8192
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_embd           = 8192
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_layer          = 80
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_head           = 64
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_head_kv        = 8
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_rot            = 128
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_swa            = 0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_embd_head_k    = 128
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_embd_head_v    = 128
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_gqa            = 8
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_embd_k_gqa     = 1024
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_embd_v_gqa     = 1024
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_ff             = 28672
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_expert         = 0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_expert_used    = 0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: causal attn      = 1
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: pooling type     = 0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: rope type        = 0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: rope scaling     = linear
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: freq_base_train  = 500000.0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: freq_scale_train = 1
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_ctx_orig_yarn  = 8192
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: rope_finetuned   = unknown
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: ssm_d_conv       = 0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: ssm_d_inner      = 0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: ssm_d_state      = 0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: ssm_dt_rank      = 0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: model type       = 70B
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: model ftype      = Q4_0
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: model params     = 70.55 B
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: model size       = 37.22 GiB (4.53 BPW)
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: general.name     = Meta-Llama-3-70B-Instruct
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: LF token         = 128 'Ä'
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: max token length = 256
Jul 24 16:07:53 vega ollama[114917]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Jul 24 16:07:53 vega ollama[114917]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Jul 24 16:07:53 vega ollama[114917]: ggml_cuda_init: found 1 ROCm devices:
Jul 24 16:07:53 vega ollama[114917]:   Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
Jul 24 16:07:53 vega ollama[114917]: llm_load_tensors: ggml ctx size =    0.68 MiB
Jul 24 16:07:55 vega ollama[114917]: llm_load_tensors: offloading 48 repeating layers to GPU
Jul 24 16:07:55 vega ollama[114917]: llm_load_tensors: offloaded 48/81 layers to GPU
Jul 24 16:07:55 vega ollama[114917]: llm_load_tensors:      ROCm0 buffer size = 22035.00 MiB
Jul 24 16:07:55 vega ollama[114917]: llm_load_tensors:        CPU buffer size = 38110.61 MiB
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: n_ctx      = 2048
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: n_batch    = 512
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: n_ubatch   = 512
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: flash_attn = 0
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: freq_base  = 500000.0
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: freq_scale = 1
Jul 24 16:07:57 vega ollama[114917]: llama_kv_cache_init:      ROCm0 KV buffer size =   384.00 MiB
Jul 24 16:07:57 vega ollama[114917]: llama_kv_cache_init:  ROCm_Host KV buffer size =   256.00 MiB
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model:  ROCm_Host  output buffer size =     0.52 MiB
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model:      ROCm0 compute buffer size =  1104.45 MiB
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model:  ROCm_Host compute buffer size =    20.01 MiB
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: graph nodes  = 2566
Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: graph splits = 356
Jul 24 16:07:58 vega ollama[115056]: INFO [main] model loaded | tid="140302582510400" timestamp=1721830078
Jul 24 16:07:58 vega ollama[114917]: time=2024-07-24T16:07:58.350+02:00 level=INFO source=server.go:622 msg="llama runner started in 6.54 seconds"
Jul 24 16:07:58 vega ollama[114917]: CUDA error: out of memory
Jul 24 16:07:58 vega ollama[114917]:   current device: 0, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:291
Jul 24 16:07:58 vega ollama[114917]:   ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
Jul 24 16:07:58 vega ollama[114917]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: !"CUDA error"
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115057]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115058]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115059]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115060]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115061]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115062]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115063]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115064]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115065]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115066]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115067]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115068]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115069]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115070]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115071]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115072]
Jul 24 16:07:58 vega ollama[115138]: [New LWP 115083]
Jul 24 16:07:58 vega ollama[115138]: [Thread debugging using libthread_db enabled]
Jul 24 16:07:58 vega ollama[115138]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Jul 24 16:07:58 vega ollama[115138]: 0x00007f9abdcf2b57 in __GI___wait4 (pid=115138, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Jul 24 16:07:58 vega ollama[114917]: 30        ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
Jul 24 16:07:58 vega ollama[115138]: #0  0x00007f9abdcf2b57 in __GI___wait4 (pid=115138, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Jul 24 16:07:58 vega ollama[115138]: 30        in ../sysdeps/unix/sysv/linux/wait4.c
Jul 24 16:07:58 vega ollama[115138]: #1  0x00000000178d6c87 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
Jul 24 16:07:58 vega ollama[115138]: #2  0x00000000178ec50e in ggml_cuda_pool_leg::alloc(unsigned long, unsigned long*) ()
Jul 24 16:07:58 vega ollama[115138]: #3  0x00000000178ecc60 in ggml_cuda_pool_alloc<__half>::alloc(unsigned long) ()
Jul 24 16:07:58 vega ollama[115138]: #4  0x00000000178e2afb in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
Jul 24 16:07:58 vega ollama[115138]: #5  0x00000000178da448 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
Jul 24 16:07:58 vega ollama[115138]: #6  0x000000001789eac8 in ggml_backend_sched_graph_compute_async ()
Jul 24 16:07:58 vega ollama[115138]: #7  0x0000000017a85789 in llama_decode ()
Jul 24 16:07:58 vega ollama[115138]: #8  0x00000000177aac61 in llama_server_context::update_slots() ()
Jul 24 16:07:58 vega ollama[115138]: #9  0x00000000177ace2a in llama_server_queue::start_loop() ()
Jul 24 16:07:58 vega ollama[115138]: #10 0x000000001779073f in main ()
Jul 24 16:07:58 vega ollama[115138]: [Inferior 1 (process 115056) detached]
Jul 24 16:07:59 vega ollama[114917]: [GIN] 2024/07/24 - 16:07:59 | 200 |  7.978114451s | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | POST     "/api/chat"

Any pointers on what might be wrong here?
Running smaller models works just fine.

OS

Linux 6.1.0-22-amd64 (debian stable)

GPU

AMD Radeon RX 7900 XTX (24 GiB VRAM)

CPU

AMD Ryzen 7 7700X

Ollama version

0.2.8

Model

llama3:70b-instruct-q4_0

Originally created by @oleid on GitHub (Jul 24, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5913 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I've noticed there are other recent issues on offloading, however, since I'm using a different setup I thought opening a seperate issue would make sense. I neither use docker nor nvidia. But ROCm 6.1.3 on debian stable. Before each test, I stopped ollama (since I hope that would free any remaining GPU memory claim). With ollama running I get: ```` $ free -m total used free shared buff/cache available Mem: 63508 2349 753 8 61127 61158 Swap: 0 0 0 ```` Also, nothing claims the GPU (I use a desktop, however on the internal GPU (device 1)): ```` $ rocm-smi --device=0 --showmemuse --showpids ======================= ROCm System Management Interface ======================= ============================== Current Memory Use ============================== GPU[0] : GPU memory use (%): 0 GPU[0] : Memory Activity: N/A ================================================================================ ================================ KFD Processes ================================= No KFD PIDs currently running ================================================================================ ============================= End of ROCm SMI Log ============================== ```` When running the same command while I hit return in openwebui I get: ```` $ rocm-smi --device=0 --showmemuse --showpids ======================= ROCm System Management Interface ======================= ============================== Current Memory Use ============================== GPU[0] : GPU memory use (%): 2 GPU[0] : Memory Activity: N/A ================================================================================ ================================ KFD Processes ================================= KFD process information: PID PROCESS NAME GPU(s) VRAM USED SDMA USED CU OCCUPANCY 115056 ollama_llama_se 1 25240158208 0 0 ================================================================================ ============================= End of ROCm SMI Log ============================== ```` ![grafik](https://github.com/user-attachments/assets/3f91376a-54f8-458a-a24e-a30bd1d8c9a7) ### Sample output of the error <details> <summary>[Output of ollama service, click to expand]</summary> ```` Jul 24 16:07:04 vega systemd[1]: Stopping ollama.service - Ollama Service... Jul 24 16:07:04 vega systemd[1]: ollama.service: Deactivated successfully. Jul 24 16:07:04 vega systemd[1]: Stopped ollama.service - Ollama Service. Jul 24 16:07:04 vega systemd[1]: Started ollama.service - Ollama Service. Jul 24 16:07:04 vega ollama[114917]: 2024/07/24 16:07:04 routes.go:1100: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" Jul 24 16:07:04 vega ollama[114917]: time=2024-07-24T16:07:04.312+02:00 level=INFO source=images.go:784 msg="total blobs: 47" Jul 24 16:07:04 vega ollama[114917]: time=2024-07-24T16:07:04.312+02:00 level=INFO source=images.go:791 msg="total unused blobs removed: 0" Jul 24 16:07:04 vega ollama[114917]: time=2024-07-24T16:07:04.312+02:00 level=INFO source=routes.go:1147 msg="Listening on [::]:11434 (version 0.2.8)" Jul 24 16:07:04 vega ollama[114917]: time=2024-07-24T16:07:04.313+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3590369896/runners Jul 24 16:07:06 vega ollama[114917]: time=2024-07-24T16:07:06.996+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60102 cpu]" Jul 24 16:07:06 vega ollama[114917]: time=2024-07-24T16:07:06.996+02:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" Jul 24 16:07:07 vega ollama[114917]: time=2024-07-24T16:07:07.004+02:00 level=INFO source=amd_linux.go:330 msg="amdgpu is supported" gpu=0 gpu_type=gfx1100 Jul 24 16:07:07 vega ollama[114917]: time=2024-07-24T16:07:07.004+02:00 level=INFO source=amd_linux.go:259 msg="unsupported Radeon iGPU detected skipping" id=1 total="512.0 MiB" Jul 24 16:07:07 vega ollama[114917]: time=2024-07-24T16:07:07.004+02:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=rocm compute=gfx1100 driver=6.7 name=1002:744c total="24.0 GiB" available="24.0 GiB" Jul 24 16:07:51 vega ollama[114917]: time=2024-07-24T16:07:51.812+02:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=48 layers.split="" memory.available="[24.0 GiB]" memory.required.full="39.3 GiB" memory.required.partial="23.9 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.9 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB" Jul 24 16:07:51 vega ollama[114917]: time=2024-07-24T16:07:51.813+02:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3590369896/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-0bd51f8f0c975ce910ed067dcb962a9af05b77bafcdc595ef02178387f10e51d --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 48 --parallel 1 --port 36639" Jul 24 16:07:51 vega ollama[114917]: time=2024-07-24T16:07:51.813+02:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 Jul 24 16:07:51 vega ollama[114917]: time=2024-07-24T16:07:51.813+02:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding" Jul 24 16:07:51 vega ollama[114917]: time=2024-07-24T16:07:51.813+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error" Jul 24 16:07:51 vega ollama[115056]: INFO [main] build info | build=1 commit="d94c6e0" tid="140302582510400" timestamp=1721830071 Jul 24 16:07:51 vega ollama[115056]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140302582510400" timestamp=1721830071 total_threads=16 Jul 24 16:07:51 vega ollama[115056]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="36639" tid="140302582510400" timestamp=1721830071 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-0bd51f8f0c975ce910ed067dcb962a9af05b77bafcdc595ef02178387f10e51d (version GGUF V3 (latest)) Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 0: general.architecture str = llama Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 1: general.name str = Meta-Llama-3-70B-Instruct Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 2: llama.block_count u32 = 80 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 3: llama.context_length u32 = 8192 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 4: llama.embedding_length u32 = 8192 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 6: llama.attention.head_count u32 = 64 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 10: general.file_type u32 = 2 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - kv 21: general.quantization_version u32 = 2 Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - type f32: 161 tensors Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - type q4_0: 561 tensors Jul 24 16:07:51 vega ollama[114917]: llama_model_loader: - type q6_K: 1 tensors Jul 24 16:07:52 vega ollama[114917]: time=2024-07-24T16:07:52.065+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model" Jul 24 16:07:52 vega ollama[114917]: llm_load_vocab: special tokens cache size = 256 Jul 24 16:07:52 vega ollama[114917]: llm_load_vocab: token to piece cache size = 0.8000 MB Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: format = GGUF V3 (latest) Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: arch = llama Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: vocab type = BPE Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_vocab = 128256 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_merges = 280147 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: vocab_only = 0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_ctx_train = 8192 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_embd = 8192 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_layer = 80 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_head = 64 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_head_kv = 8 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_rot = 128 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_swa = 0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_embd_head_k = 128 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_embd_head_v = 128 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_gqa = 8 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_embd_k_gqa = 1024 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_embd_v_gqa = 1024 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: f_norm_eps = 0.0e+00 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: f_logit_scale = 0.0e+00 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_ff = 28672 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_expert = 0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_expert_used = 0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: causal attn = 1 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: pooling type = 0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: rope type = 0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: rope scaling = linear Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: freq_base_train = 500000.0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: freq_scale_train = 1 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: n_ctx_orig_yarn = 8192 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: rope_finetuned = unknown Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: ssm_d_conv = 0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: ssm_d_inner = 0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: ssm_d_state = 0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: ssm_dt_rank = 0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: model type = 70B Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: model ftype = Q4_0 Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: model params = 70.55 B Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW) Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: general.name = Meta-Llama-3-70B-Instruct Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: LF token = 128 'Ä' Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Jul 24 16:07:52 vega ollama[114917]: llm_load_print_meta: max token length = 256 Jul 24 16:07:53 vega ollama[114917]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Jul 24 16:07:53 vega ollama[114917]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Jul 24 16:07:53 vega ollama[114917]: ggml_cuda_init: found 1 ROCm devices: Jul 24 16:07:53 vega ollama[114917]: Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no Jul 24 16:07:53 vega ollama[114917]: llm_load_tensors: ggml ctx size = 0.68 MiB Jul 24 16:07:55 vega ollama[114917]: llm_load_tensors: offloading 48 repeating layers to GPU Jul 24 16:07:55 vega ollama[114917]: llm_load_tensors: offloaded 48/81 layers to GPU Jul 24 16:07:55 vega ollama[114917]: llm_load_tensors: ROCm0 buffer size = 22035.00 MiB Jul 24 16:07:55 vega ollama[114917]: llm_load_tensors: CPU buffer size = 38110.61 MiB Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: n_ctx = 2048 Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: n_batch = 512 Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: n_ubatch = 512 Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: flash_attn = 0 Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: freq_base = 500000.0 Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: freq_scale = 1 Jul 24 16:07:57 vega ollama[114917]: llama_kv_cache_init: ROCm0 KV buffer size = 384.00 MiB Jul 24 16:07:57 vega ollama[114917]: llama_kv_cache_init: ROCm_Host KV buffer size = 256.00 MiB Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: ROCm_Host output buffer size = 0.52 MiB Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: ROCm0 compute buffer size = 1104.45 MiB Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: ROCm_Host compute buffer size = 20.01 MiB Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: graph nodes = 2566 Jul 24 16:07:57 vega ollama[114917]: llama_new_context_with_model: graph splits = 356 Jul 24 16:07:58 vega ollama[115056]: INFO [main] model loaded | tid="140302582510400" timestamp=1721830078 Jul 24 16:07:58 vega ollama[114917]: time=2024-07-24T16:07:58.350+02:00 level=INFO source=server.go:622 msg="llama runner started in 6.54 seconds" Jul 24 16:07:58 vega ollama[114917]: CUDA error: out of memory Jul 24 16:07:58 vega ollama[114917]: current device: 0, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:291 Jul 24 16:07:58 vega ollama[114917]: ggml_cuda_device_malloc(&ptr, look_ahead_size, device) Jul 24 16:07:58 vega ollama[114917]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: !"CUDA error" Jul 24 16:07:58 vega ollama[115138]: [New LWP 115057] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115058] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115059] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115060] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115061] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115062] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115063] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115064] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115065] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115066] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115067] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115068] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115069] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115070] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115071] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115072] Jul 24 16:07:58 vega ollama[115138]: [New LWP 115083] Jul 24 16:07:58 vega ollama[115138]: [Thread debugging using libthread_db enabled] Jul 24 16:07:58 vega ollama[115138]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Jul 24 16:07:58 vega ollama[115138]: 0x00007f9abdcf2b57 in __GI___wait4 (pid=115138, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Jul 24 16:07:58 vega ollama[114917]: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. Jul 24 16:07:58 vega ollama[115138]: #0 0x00007f9abdcf2b57 in __GI___wait4 (pid=115138, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Jul 24 16:07:58 vega ollama[115138]: 30 in ../sysdeps/unix/sysv/linux/wait4.c Jul 24 16:07:58 vega ollama[115138]: #1 0x00000000178d6c87 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () Jul 24 16:07:58 vega ollama[115138]: #2 0x00000000178ec50e in ggml_cuda_pool_leg::alloc(unsigned long, unsigned long*) () Jul 24 16:07:58 vega ollama[115138]: #3 0x00000000178ecc60 in ggml_cuda_pool_alloc<__half>::alloc(unsigned long) () Jul 24 16:07:58 vega ollama[115138]: #4 0x00000000178e2afb in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) () Jul 24 16:07:58 vega ollama[115138]: #5 0x00000000178da448 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () Jul 24 16:07:58 vega ollama[115138]: #6 0x000000001789eac8 in ggml_backend_sched_graph_compute_async () Jul 24 16:07:58 vega ollama[115138]: #7 0x0000000017a85789 in llama_decode () Jul 24 16:07:58 vega ollama[115138]: #8 0x00000000177aac61 in llama_server_context::update_slots() () Jul 24 16:07:58 vega ollama[115138]: #9 0x00000000177ace2a in llama_server_queue::start_loop() () Jul 24 16:07:58 vega ollama[115138]: #10 0x000000001779073f in main () Jul 24 16:07:58 vega ollama[115138]: [Inferior 1 (process 115056) detached] Jul 24 16:07:59 vega ollama[114917]: [GIN] 2024/07/24 - 16:07:59 | 200 | 7.978114451s | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | POST "/api/chat" ```` </details> Any pointers on what might be wrong here? Running smaller models works just fine. ### OS Linux 6.1.0-22-amd64 (debian stable) ### GPU AMD Radeon RX 7900 XTX (24 GiB VRAM) ### CPU AMD Ryzen 7 7700X ### Ollama version 0.2.8 ### Model `llama3:70b-instruct-q4_0`
GiteaMirror added the memorybug labels 2026-04-12 14:30:23 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 24, 2024):

No answers, just observations. Your card has 24G and llama.cpp is allocating 23.9 GiB of it. There are two ways to mitigate the issue. The first is to reduce the number of layers that llama.cpp is offloading to the card, either by adding "options": {"num_gpu": 46} to the API call, where the 46 is the number of layers to offload, see offloaded in the logs - a lower number will reduce the memory pressure. Since you're using a UI you probably don't have the option of fine tuning the API calls, so you can create a new model that has a builtin layer count:

$ ollama show --modelfile llama3.1:70b-instruct-q4_0 > Modelfile
# edit Modelfile and add "PARAMETER num_gpu 46"
$ ollama create llama3.1:70b-instruct-n46-q4_0 -f Modelfile

The other thing you can try is turning on flash attention by adding OLLAMA_FLASH_ATTENTION=1 to the server environment. Flash attention makes better use of the KV cache and so may also reduce memory pressure.

Having said that, I am also seeing the occasional OOM, so the memory calculations need some TLC.

<!-- gh-comment-id:2248262520 --> @rick-github commented on GitHub (Jul 24, 2024): No answers, just observations. Your card has 24G and llama.cpp is allocating 23.9 GiB of it. There are two ways to mitigate the issue. The first is to reduce the number of layers that llama.cpp is offloading to the card, either by adding `"options": {"num_gpu": 46}` to the API call, where the `46` is the number of layers to offload, see `offloaded` in the logs - a lower number will reduce the memory pressure. Since you're using a UI you probably don't have the option of fine tuning the API calls, so you can create a new model that has a builtin layer count: ```sh $ ollama show --modelfile llama3.1:70b-instruct-q4_0 > Modelfile # edit Modelfile and add "PARAMETER num_gpu 46" $ ollama create llama3.1:70b-instruct-n46-q4_0 -f Modelfile ``` The other thing you can try is turning on flash attention by adding `OLLAMA_FLASH_ATTENTION=1` to the server environment. Flash attention makes better use of the KV cache and so may also reduce memory pressure. Having said that, I am also seeing the occasional OOM, so the memory calculations need some TLC.
Author
Owner

@oleid commented on GitHub (Jul 24, 2024):

@rick-github Thanks for the hint!

While it would seem that OLLAMA_FLASH_ATTENTION=1 doesn't change anything in this case, I can confirm that manually using 46 layers indeed works around the issue.

<!-- gh-comment-id:2248308828 --> @oleid commented on GitHub (Jul 24, 2024): @rick-github Thanks for the hint! While it would seem that `OLLAMA_FLASH_ATTENTION=1` doesn't change anything in this case, I can confirm that manually using 46 layers indeed works around the issue.
Author
Owner

@dhiltgen commented on GitHub (Jul 24, 2024):

@oleid which model were you trying to load?

<!-- gh-comment-id:2248788510 --> @dhiltgen commented on GitHub (Jul 24, 2024): @oleid which model were you trying to load?
Author
Owner

@rick-github commented on GitHub (Jul 24, 2024):

llama3:70b-instruct-q4_0

<!-- gh-comment-id:2248792625 --> @rick-github commented on GitHub (Jul 24, 2024): llama3:70b-instruct-q4_0
Author
Owner

@oleid commented on GitHub (Jul 24, 2024):

@dhiltgen llama3:70b

FWIW: I just tried CognitiveComputations/dolphin-qwen2:72b-v2.9.2-q4_k_s and that appears offloading happens to work fine, here. memory.required.partial="23.7 GiB" for LLAMA3 it is memory.required.partial="23.9 GiB"

[Output of ollama service, click to expand]
Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.743+02:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=43 layers.split="" memory.available="[24.0 GiB]" memory.required.full="43.1 GiB" memory.required.partial="23.7 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.7 GiB]" memory.weights.total="39.9 GiB" memory.weights.repeating="38.9 GiB" memory.weights.nonrepeating="974.6 MiB" memory.graph.full="313.0 MiB" memory.graph.partial="1.3 GiB"
Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.743+02:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3633236186/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-11fe3cb8956204489b5d545e5c8bae6cf888d110f0fd39bc2afdb6a5644d76be --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 43 --parallel 1 --port 45365"
Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.743+02:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.743+02:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.743+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
Jul 24 21:50:02 vega ollama[174233]: INFO [main] build info | build=1 commit="d94c6e0" tid="139910788932416" timestamp=1721850602
Jul 24 21:50:02 vega ollama[174233]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="139910788932416" timestamp=1721850602 total_threads=16
Jul 24 21:50:02 vega ollama[174233]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="45365" tid="139910788932416" timestamp=1721850602
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: loaded meta data with 24 key-value pairs and 963 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-11fe3cb8956204489b5d545e5c8bae6cf888d110f0fd39bc2afdb6a5644d76be (version GGUF V3 (latest))
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv   1:                               general.name str              = qwen2
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv   2:                          qwen2.block_count u32              = 80
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv   3:                       qwen2.context_length u32              = 131072
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 8192
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 29568
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 64
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 8
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  10:                          general.file_type u32              = 14
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  19:               general.quantization_version u32              = 2
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  20:                      quantize.imatrix.file str              = /models/qwen2-GGUF/qwen2.imatrix
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  21:                   quantize.imatrix.dataset str              = /training_data/calibration_datav3.txt
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  22:             quantize.imatrix.entries_count i32              = 560
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv  23:              quantize.imatrix.chunks_count i32              = 128
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type  f32:  401 tensors
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type q5_0:   70 tensors
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type q5_1:   10 tensors
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type q4_K:  401 tensors
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type q5_K:   80 tensors
Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type q6_K:    1 tensors
Jul 24 21:50:02 vega ollama[132945]: llm_load_vocab: special tokens cache size = 421
Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.994+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model"
Jul 24 21:50:03 vega ollama[132945]: llm_load_vocab: token to piece cache size = 0.9352 MB
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: format           = GGUF V3 (latest)
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: arch             = qwen2
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: vocab type       = BPE
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_vocab          = 152064
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_merges         = 151387
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: vocab_only       = 0
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_ctx_train      = 131072
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_embd           = 8192
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_layer          = 80
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_head           = 64
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_head_kv        = 8
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_rot            = 128
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_swa            = 0
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_embd_head_k    = 128
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_embd_head_v    = 128
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_gqa            = 8
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_embd_k_gqa     = 1024
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_embd_v_gqa     = 1024
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_ff             = 29568
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_expert         = 0
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_expert_used    = 0
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: causal attn      = 1
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: pooling type     = 0
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: rope type        = 2
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: rope scaling     = linear
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: freq_base_train  = 1000000.0
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: freq_scale_train = 1
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: rope_finetuned   = unknown
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: ssm_d_conv       = 0
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: ssm_d_inner      = 0
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: ssm_d_state      = 0
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: ssm_dt_rank      = 0
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: model type       = 70B
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: model ftype      = Q4_K - Small
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: model params     = 72.71 B
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: model size       = 40.87 GiB (4.83 BPW)
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: general.name     = qwen2
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: BOS token        = 11 ','
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: LF token         = 148848 'ÄĬ'
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: max token length = 256
Jul 24 21:50:04 vega ollama[132945]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Jul 24 21:50:04 vega ollama[132945]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Jul 24 21:50:04 vega ollama[132945]: ggml_cuda_init: found 1 ROCm devices:
Jul 24 21:50:04 vega ollama[132945]:   Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
Jul 24 21:50:04 vega ollama[132945]: llm_load_tensors: ggml ctx size =    0.85 MiB
Jul 24 21:50:05 vega ollama[132945]: time=2024-07-24T21:50:05.702+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server not responding"
Jul 24 21:50:06 vega ollama[132945]: time=2024-07-24T21:50:06.542+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model"
Jul 24 21:50:06 vega ollama[132945]: llm_load_tensors: offloading 43 repeating layers to GPU
Jul 24 21:50:06 vega ollama[132945]: llm_load_tensors: offloaded 43/81 layers to GPU
Jul 24 21:50:06 vega ollama[132945]: llm_load_tensors:      ROCm0 buffer size = 21533.93 MiB
Jul 24 21:50:06 vega ollama[132945]: llm_load_tensors:        CPU buffer size = 41850.31 MiB
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: n_ctx      = 2048
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: n_batch    = 512
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: n_ubatch   = 512
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: flash_attn = 0
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: freq_base  = 1000000.0
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: freq_scale = 1
Jul 24 21:50:08 vega ollama[132945]: llama_kv_cache_init:      ROCm0 KV buffer size =   344.00 MiB
Jul 24 21:50:08 vega ollama[132945]: llama_kv_cache_init:  ROCm_Host KV buffer size =   296.00 MiB
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model:  ROCm_Host  output buffer size =     0.61 MiB
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model:      ROCm0 compute buffer size =  1287.53 MiB
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model:  ROCm_Host compute buffer size =    20.01 MiB
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: graph nodes  = 2806
Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: graph splits = 522
Jul 24 21:50:09 vega ollama[174233]: INFO [main] model loaded | tid="139910788932416" timestamp=1721850609
Jul 24 21:50:09 vega ollama[132945]: time=2024-07-24T21:50:09.556+02:00 level=INFO source=server.go:622 msg="llama runner started in 6.81 seconds"
Jul 24 21:51:18 vega ollama[132945]: [GIN] 2024/07/24 - 21:51:18 | 200 |         1m15s |       127.0.0.1 | POST     "/api/generate"
Jul 24 21:53:00 vega ollama[132945]: [GIN] 2024/07/24 - 21:53:00 | 200 |      14.768µs |       127.0.0.1 | HEAD     "/"
Jul 24 21:53:00 vega ollama[132945]: [GIN] 2024/07/24 - 21:53:00 | 200 |    9.223694ms |       127.0.0.1 | POST     "/api/show"
Jul 24 21:53:19 vega ollama[132945]: [GIN] 2024/07/24 - 21:53:19 | 200 | 18.637106356s |       127.0.0.1 | POST     "/api/generate"
Jul 24 21:53:37 vega ollama[132945]: [GIN] 2024/07/24 - 21:53:37 | 200 |    1.310385ms | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | GET      "/api/tags"
Jul 24 21:53:37 vega ollama[132945]: [GIN] 2024/07/24 - 21:53:37 | 200 |      34.534µs | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | GET      "/api/version"
Jul 24 21:54:48 vega ollama[132945]: [GIN] 2024/07/24 - 21:54:48 | 200 | 53.933274734s | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | POST     "/api/chat"
Jul 24 21:54:54 vega ollama[132945]: [GIN] 2024/07/24 - 21:54:54 | 200 |  5.307397265s | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | POST     "/v1/chat/completions"
Jul 24 21:55:40 vega ollama[132945]: [GIN] 2024/07/24 - 21:55:40 | 200 |      30.527µs | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | GET      "/api/version"
Jul 24 21:56:42 vega ollama[132945]: [GIN] 2024/07/24 - 21:56:42 | 200 |      13.635µs |       127.0.0.1 | HEAD     "/"
Jul 24 21:56:42 vega ollama[132945]: [GIN] 2024/07/24 - 21:56:42 | 200 |    9.354188ms |       127.0.0.1 | POST     "/api/show"
Jul 24 21:57:43 vega ollama[132945]: [GIN] 2024/07/24 - 21:57:43 | 200 |          1m1s |       127.0.0.1 | POST     "/api/generate"
<!-- gh-comment-id:2248801922 --> @oleid commented on GitHub (Jul 24, 2024): @dhiltgen [llama3:70b ](https://ollama.com/library/llama3:70b) FWIW: I just tried `CognitiveComputations/dolphin-qwen2:72b-v2.9.2-q4_k_s` and that appears offloading happens to work fine, here. ` memory.required.partial="23.7 GiB" ` for LLAMA3 it is `memory.required.partial="23.9 GiB"` <details> <summary>[Output of ollama service, click to expand]</summary> ```` Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.743+02:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=43 layers.split="" memory.available="[24.0 GiB]" memory.required.full="43.1 GiB" memory.required.partial="23.7 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.7 GiB]" memory.weights.total="39.9 GiB" memory.weights.repeating="38.9 GiB" memory.weights.nonrepeating="974.6 MiB" memory.graph.full="313.0 MiB" memory.graph.partial="1.3 GiB" Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.743+02:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3633236186/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-11fe3cb8956204489b5d545e5c8bae6cf888d110f0fd39bc2afdb6a5644d76be --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 43 --parallel 1 --port 45365" Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.743+02:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.743+02:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding" Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.743+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error" Jul 24 21:50:02 vega ollama[174233]: INFO [main] build info | build=1 commit="d94c6e0" tid="139910788932416" timestamp=1721850602 Jul 24 21:50:02 vega ollama[174233]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="139910788932416" timestamp=1721850602 total_threads=16 Jul 24 21:50:02 vega ollama[174233]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="45365" tid="139910788932416" timestamp=1721850602 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: loaded meta data with 24 key-value pairs and 963 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-11fe3cb8956204489b5d545e5c8bae6cf888d110f0fd39bc2afdb6a5644d76be (version GGUF V3 (latest)) Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 0: general.architecture str = qwen2 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 1: general.name str = qwen2 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 2: qwen2.block_count u32 = 80 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 3: qwen2.context_length u32 = 131072 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 4: qwen2.embedding_length u32 = 8192 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 29568 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 64 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 8 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 10: general.file_type u32 = 14 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 18: tokenizer.chat_template str = {% if not add_generation_prompt is de... Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 19: general.quantization_version u32 = 2 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 20: quantize.imatrix.file str = /models/qwen2-GGUF/qwen2.imatrix Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 21: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 22: quantize.imatrix.entries_count i32 = 560 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - kv 23: quantize.imatrix.chunks_count i32 = 128 Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type f32: 401 tensors Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type q5_0: 70 tensors Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type q5_1: 10 tensors Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type q4_K: 401 tensors Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type q5_K: 80 tensors Jul 24 21:50:02 vega ollama[132945]: llama_model_loader: - type q6_K: 1 tensors Jul 24 21:50:02 vega ollama[132945]: llm_load_vocab: special tokens cache size = 421 Jul 24 21:50:02 vega ollama[132945]: time=2024-07-24T21:50:02.994+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model" Jul 24 21:50:03 vega ollama[132945]: llm_load_vocab: token to piece cache size = 0.9352 MB Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: format = GGUF V3 (latest) Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: arch = qwen2 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: vocab type = BPE Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_vocab = 152064 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_merges = 151387 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: vocab_only = 0 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_ctx_train = 131072 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_embd = 8192 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_layer = 80 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_head = 64 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_head_kv = 8 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_rot = 128 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_swa = 0 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_embd_head_k = 128 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_embd_head_v = 128 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_gqa = 8 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_embd_k_gqa = 1024 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_embd_v_gqa = 1024 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: f_norm_eps = 0.0e+00 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: f_logit_scale = 0.0e+00 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_ff = 29568 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_expert = 0 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_expert_used = 0 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: causal attn = 1 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: pooling type = 0 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: rope type = 2 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: rope scaling = linear Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: freq_base_train = 1000000.0 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: freq_scale_train = 1 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: rope_finetuned = unknown Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: ssm_d_conv = 0 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: ssm_d_inner = 0 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: ssm_d_state = 0 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: ssm_dt_rank = 0 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: model type = 70B Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: model ftype = Q4_K - Small Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: model params = 72.71 B Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: model size = 40.87 GiB (4.83 BPW) Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: general.name = qwen2 Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: BOS token = 11 ',' Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: EOS token = 151645 '<|im_end|>' Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: PAD token = 151643 '<|endoftext|>' Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: LF token = 148848 'ÄĬ' Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: EOT token = 151645 '<|im_end|>' Jul 24 21:50:03 vega ollama[132945]: llm_load_print_meta: max token length = 256 Jul 24 21:50:04 vega ollama[132945]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Jul 24 21:50:04 vega ollama[132945]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Jul 24 21:50:04 vega ollama[132945]: ggml_cuda_init: found 1 ROCm devices: Jul 24 21:50:04 vega ollama[132945]: Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no Jul 24 21:50:04 vega ollama[132945]: llm_load_tensors: ggml ctx size = 0.85 MiB Jul 24 21:50:05 vega ollama[132945]: time=2024-07-24T21:50:05.702+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server not responding" Jul 24 21:50:06 vega ollama[132945]: time=2024-07-24T21:50:06.542+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model" Jul 24 21:50:06 vega ollama[132945]: llm_load_tensors: offloading 43 repeating layers to GPU Jul 24 21:50:06 vega ollama[132945]: llm_load_tensors: offloaded 43/81 layers to GPU Jul 24 21:50:06 vega ollama[132945]: llm_load_tensors: ROCm0 buffer size = 21533.93 MiB Jul 24 21:50:06 vega ollama[132945]: llm_load_tensors: CPU buffer size = 41850.31 MiB Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: n_ctx = 2048 Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: n_batch = 512 Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: n_ubatch = 512 Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: flash_attn = 0 Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: freq_base = 1000000.0 Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: freq_scale = 1 Jul 24 21:50:08 vega ollama[132945]: llama_kv_cache_init: ROCm0 KV buffer size = 344.00 MiB Jul 24 21:50:08 vega ollama[132945]: llama_kv_cache_init: ROCm_Host KV buffer size = 296.00 MiB Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: ROCm_Host output buffer size = 0.61 MiB Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: ROCm0 compute buffer size = 1287.53 MiB Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: ROCm_Host compute buffer size = 20.01 MiB Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: graph nodes = 2806 Jul 24 21:50:08 vega ollama[132945]: llama_new_context_with_model: graph splits = 522 Jul 24 21:50:09 vega ollama[174233]: INFO [main] model loaded | tid="139910788932416" timestamp=1721850609 Jul 24 21:50:09 vega ollama[132945]: time=2024-07-24T21:50:09.556+02:00 level=INFO source=server.go:622 msg="llama runner started in 6.81 seconds" Jul 24 21:51:18 vega ollama[132945]: [GIN] 2024/07/24 - 21:51:18 | 200 | 1m15s | 127.0.0.1 | POST "/api/generate" Jul 24 21:53:00 vega ollama[132945]: [GIN] 2024/07/24 - 21:53:00 | 200 | 14.768µs | 127.0.0.1 | HEAD "/" Jul 24 21:53:00 vega ollama[132945]: [GIN] 2024/07/24 - 21:53:00 | 200 | 9.223694ms | 127.0.0.1 | POST "/api/show" Jul 24 21:53:19 vega ollama[132945]: [GIN] 2024/07/24 - 21:53:19 | 200 | 18.637106356s | 127.0.0.1 | POST "/api/generate" Jul 24 21:53:37 vega ollama[132945]: [GIN] 2024/07/24 - 21:53:37 | 200 | 1.310385ms | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | GET "/api/tags" Jul 24 21:53:37 vega ollama[132945]: [GIN] 2024/07/24 - 21:53:37 | 200 | 34.534µs | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | GET "/api/version" Jul 24 21:54:48 vega ollama[132945]: [GIN] 2024/07/24 - 21:54:48 | 200 | 53.933274734s | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | POST "/api/chat" Jul 24 21:54:54 vega ollama[132945]: [GIN] 2024/07/24 - 21:54:54 | 200 | 5.307397265s | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | POST "/v1/chat/completions" Jul 24 21:55:40 vega ollama[132945]: [GIN] 2024/07/24 - 21:55:40 | 200 | 30.527µs | 2a02:b30:fac:1f00:7285:c2ff:fe0f:69f3 | GET "/api/version" Jul 24 21:56:42 vega ollama[132945]: [GIN] 2024/07/24 - 21:56:42 | 200 | 13.635µs | 127.0.0.1 | HEAD "/" Jul 24 21:56:42 vega ollama[132945]: [GIN] 2024/07/24 - 21:56:42 | 200 | 9.354188ms | 127.0.0.1 | POST "/api/show" Jul 24 21:57:43 vega ollama[132945]: [GIN] 2024/07/24 - 21:57:43 | 200 | 1m1s | 127.0.0.1 | POST "/api/generate" ```` </details>
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3690