[GH-ISSUE #7575] Multi-GPU returning garbage #4828

Closed
opened 2026-04-12 15:49:19 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @Escain on GitHub (Nov 8, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7575

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I recently upgrade a computer with a new additional GPU.

When running a model that fit on a single GPU, all works fine: the command answer and I see the GPU RAM and usage while it respond.
Before installing the second GPU, I could run models requiring more memory than the actual VRAM, and (even if slow), it was working.
Since I installed the second GPU, models requiring VRAM from both GPUs only shows "GGGGGGGGGGGGG" or garbage.

ollama -v
ollama version is 0.4.0
Version 0.3.14 had the same issue.

System:
  Kernel: 6.1.0-26-amd64 arch: x86_64 bits: 64 compiler: gcc v: 12.2.0
  Desktop: KDE Plasma v: 5.27.5 Distro: Debian GNU/Linux 12 (bookworm)
CPU:
  Info: 24-core model: AMD Ryzen Threadripper PRO 7965WX s bits: 64
Memory:
  Total: 377Gi
Graphics:
  Device-1: AMD Navi 31 [Radeon Pro W7900] driver: amdgpu v: 6.3.6
    arch: RDNA-3 bus-ID: e3:00.0
  Device-2: AMD Navi 31 [Radeon RX 7900 XT/7900 XTX] vendor: Gigabyte
    driver: amdgpu v: 6.3.6 arch: RDNA-3 bus-ID: e6:00.0

I tested with several rocm versions: 6.0.x, 6.1.4 and 6.2.2

Examples:

granite-code 20b-instruct-8k-q8_0: works fine, executed on the W7900.
nemotron 70b-instruct-q8_0: answering always "GGGGGGGG..."/garbage, executed on both GPUs.
I tested several models: llama3.1 8B and 70B in Q8, etc.

Maybe related to this: https://github.com/ollama/ollama/issues/6356

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.4.0

Originally created by @Escain on GitHub (Nov 8, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7575 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I recently upgrade a computer with a new additional GPU. When running a model that fit on a single GPU, all works fine: the command answer and I see the GPU RAM and usage while it respond. Before installing the second GPU, I could run models requiring more memory than the actual VRAM, and (even if slow), it was working. Since I installed the second GPU, models requiring VRAM from both GPUs only shows "GGGGGGGGGGGGG" or garbage. ``` ollama -v ollama version is 0.4.0 Version 0.3.14 had the same issue. System: Kernel: 6.1.0-26-amd64 arch: x86_64 bits: 64 compiler: gcc v: 12.2.0 Desktop: KDE Plasma v: 5.27.5 Distro: Debian GNU/Linux 12 (bookworm) CPU: Info: 24-core model: AMD Ryzen Threadripper PRO 7965WX s bits: 64 Memory: Total: 377Gi Graphics: Device-1: AMD Navi 31 [Radeon Pro W7900] driver: amdgpu v: 6.3.6 arch: RDNA-3 bus-ID: e3:00.0 Device-2: AMD Navi 31 [Radeon RX 7900 XT/7900 XTX] vendor: Gigabyte driver: amdgpu v: 6.3.6 arch: RDNA-3 bus-ID: e6:00.0 ``` I tested with several rocm versions: 6.0.x, 6.1.4 and 6.2.2 Examples: granite-code 20b-instruct-8k-q8_0: works fine, executed on the W7900. nemotron 70b-instruct-q8_0: answering always "GGGGGGGG..."/garbage, executed on both GPUs. I tested several models: llama3.1 8B and 70B in Q8, etc. Maybe related to this: https://github.com/ollama/ollama/issues/6356 ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.4.0
GiteaMirror added the bugamd labels 2026-04-12 15:49:19 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 8, 2024):

If you add server logs it will provide more context for debugging.

<!-- gh-comment-id:2465109783 --> @rick-github commented on GitHub (Nov 8, 2024): If you add [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) it will provide more context for debugging.
Author
Owner

@Escain commented on GitHub (Nov 8, 2024):

17:13:28: Started ollama.service - Ollama Service.
17:13:28: 2024/11/08 17:13:28 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11433 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[*.ultibotics.com http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
17:13:28: time=2024-11-08T17:13:28.459+01:00 level=INFO source=images.go:755 msg="total blobs: 50"
17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0"
17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11433 (version 0.4.0)"
17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3814801335/runners
17:13:28: time=2024-11-08T17:13:28.505+01:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11 cuda_v12 rocm cpu cpu_avx]"
17:13:28: time=2024-11-08T17:13:28.505+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
17:13:28: time=2024-11-08T17:13:28.515+01:00 level=INFO source=amd_linux.go:383 msg="amdgpu is supported" gpu=GPU-a98bd33a087b0622 gpu_type=gfx1100
17:13:28: time=2024-11-08T17:13:28.516+01:00 level=INFO source=amd_linux.go:383 msg="amdgpu is supported" gpu=GPU-b32cc23f0db1465d gpu_type=gfx1100
17:13:28: time=2024-11-08T17:13:28.522+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a98bd33a087b0622 library=rocm variant="" compute=gfx1100 driver=6.7 name=1002:7448 total="45.0 GiB" available="45.0 GiB"
17:13:28: time=2024-11-08T17:13:28.522+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-b32cc23f0db1465d library=rocm variant="" compute=gfx1100 driver=6.7 name=1002:744c total="24.0 GiB" available="23.0 GiB"
17:13:33: [GIN] 2024/11/08 - 17:13:33 | 200 |      40.799µs |       127.0.0.1 | HEAD     "/"
17:13:33: [GIN] 2024/11/08 - 17:13:33 | 200 |   21.200866ms |       127.0.0.1 | POST     "/api/show"
17:13:33: time=2024-11-08T17:13:33.452+01:00 level=INFO source=server.go:105 msg="system memory" total="377.2 GiB" free="367.2 GiB" free_swap="0 B"
17:13:33: time=2024-11-08T17:13:33.452+01:00 level=INFO source=memory.go:343 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=73 layers.split=49,24 memory.available="[45.0 GiB 23.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="74.2 GiB" memory.required.partial="67.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[44.3 GiB 22.9 GiB]" memory.weights.total="68.4 GiB" memory.weights.repeating="67.3 GiB" memory.weights.nonrepeating="1.0 GiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama3814801335/runners/rocm/ollama_llama_server --model /home/ollama_models/blobs/sha256-e1394fca2f0d8147f867e4c0bc7d1cddeb122c4d0daf50fd9a874d182a88af85 --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 73 --threads 24 --parallel 1 --tensor-split 49,24 --port 42697"
17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:567 msg="waiting for llama runner to start responding"
17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:601 msg="waiting for server to become available" status="llm server error"
17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=runner.go:869 msg="starting go runner"
17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=runner.go:870 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=24
17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:42697"
17:13:33: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /home/ollama_models/blobs/sha256-e1394fca2f0d8147f867e4c0bc7d1cddeb122c4d0daf50fd9a874d182a88af85 (version GGUF V3 (latest))
17:13:33: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
17:13:33: llama_model_loader: - kv   0:                       general.architecture str              = llama
17:13:33: llama_model_loader: - kv   1:                               general.type str              = model
17:13:33: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 70B Instruct
17:13:33: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
17:13:33: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
17:13:33: llama_model_loader: - kv   5:                         general.size_label str              = 70B
17:13:33: llama_model_loader: - kv   6:                            general.license str              = llama3.1
17:13:33: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
17:13:33: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
17:13:33: llama_model_loader: - kv   9:                          llama.block_count u32              = 80
17:13:33: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
17:13:33: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 8192
17:13:33: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 28672
17:13:33: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 64
17:13:33: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
17:13:33: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
17:13:33: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
17:13:33: llama_model_loader: - kv  17:                          general.file_type u32              = 7
17:13:33: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
17:13:33: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
17:13:33: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
17:13:33: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
17:13:33: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
17:13:33: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
17:13:33: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
17:13:33: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
17:13:33: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
17:13:33: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
17:13:33: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
17:13:33: llama_model_loader: - type  f32:  162 tensors
17:13:33: llama_model_loader: - type q8_0:  562 tensors
17:13:33: time=2024-11-08T17:13:33.707+01:00 level=INFO source=server.go:601 msg="waiting for server to become available" status="llm server loading model"
17:13:33: llm_load_vocab: special tokens cache size = 256
17:13:33: llm_load_vocab: token to piece cache size = 0.7999 MB
17:13:33: llm_load_print_meta: format           = GGUF V3 (latest)
17:13:33: llm_load_print_meta: arch             = llama
17:13:33: llm_load_print_meta: vocab type       = BPE
17:13:33: llm_load_print_meta: n_vocab          = 128256
17:13:33: llm_load_print_meta: n_merges         = 280147
17:13:33: llm_load_print_meta: vocab_only       = 0
17:13:33: llm_load_print_meta: n_ctx_train      = 131072
17:13:33: llm_load_print_meta: n_embd           = 8192
17:13:33: llm_load_print_meta: n_layer          = 80
17:13:33: llm_load_print_meta: n_head           = 64
17:13:33: llm_load_print_meta: n_head_kv        = 8
17:13:33: llm_load_print_meta: n_rot            = 128
17:13:33: llm_load_print_meta: n_swa            = 0
17:13:33: llm_load_print_meta: n_embd_head_k    = 128
17:13:33: llm_load_print_meta: n_embd_head_v    = 128
17:13:33: llm_load_print_meta: n_gqa            = 8
17:13:33: llm_load_print_meta: n_embd_k_gqa     = 1024
17:13:33: llm_load_print_meta: n_embd_v_gqa     = 1024
17:13:33: llm_load_print_meta: f_norm_eps       = 0.0e+00
17:13:33: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
17:13:33: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
17:13:33: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
17:13:33: llm_load_print_meta: f_logit_scale    = 0.0e+00
17:13:33: llm_load_print_meta: n_ff             = 28672
17:13:33: llm_load_print_meta: n_expert         = 0
17:13:33: llm_load_print_meta: n_expert_used    = 0
17:13:33: llm_load_print_meta: causal attn      = 1
17:13:33: llm_load_print_meta: pooling type     = 0
17:13:33: llm_load_print_meta: rope type        = 0
17:13:33: llm_load_print_meta: rope scaling     = linear
17:13:33: llm_load_print_meta: freq_base_train  = 500000.0
17:13:33: llm_load_print_meta: freq_scale_train = 1
17:13:33: llm_load_print_meta: n_ctx_orig_yarn  = 131072
17:13:33: llm_load_print_meta: rope_finetuned   = unknown
17:13:33: llm_load_print_meta: ssm_d_conv       = 0
17:13:33: llm_load_print_meta: ssm_d_inner      = 0
17:13:33: llm_load_print_meta: ssm_d_state      = 0
17:13:33: llm_load_print_meta: ssm_dt_rank      = 0
17:13:33: llm_load_print_meta: ssm_dt_b_c_rms   = 0
17:13:33: llm_load_print_meta: model type       = 70B
17:13:33: llm_load_print_meta: model ftype      = Q8_0
17:13:33: llm_load_print_meta: model params     = 70.55 B
17:13:33: llm_load_print_meta: model size       = 69.82 GiB (8.50 BPW)
17:13:33: llm_load_print_meta: general.name     = Meta Llama 3.1 70B Instruct
17:13:33: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
17:13:33: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
17:13:33: llm_load_print_meta: LF token         = 128 'Ä'
17:13:33: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
17:13:33: llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
17:13:33: llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
17:13:33: llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
17:13:33: llm_load_print_meta: max token length = 256
17:13:34: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
17:13:34: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
17:13:34: ggml_cuda_init: found 2 ROCm devices:
17:13:34:   Device 0: AMD Radeon PRO W7900, compute capability 11.0, VMM: no
17:13:34:   Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
17:13:34: llm_load_tensors: ggml ctx size =    1.02 MiB
17:13:37: llm_load_tensors: offloading 73 repeating layers to GPU
17:13:37: llm_load_tensors: offloaded 73/81 layers to GPU
17:13:37: llm_load_tensors:      ROCm0 buffer size = 42486.07 MiB
17:13:37: llm_load_tensors:      ROCm1 buffer size = 20809.51 MiB
17:13:37: llm_load_tensors:        CPU buffer size = 71494.28 MiB
17:13:43: llama_new_context_with_model: n_ctx      = 2048
17:13:43: llama_new_context_with_model: n_batch    = 512
17:13:43: llama_new_context_with_model: n_ubatch   = 512
17:13:43: llama_new_context_with_model: flash_attn = 0
17:13:43: llama_new_context_with_model: freq_base  = 500000.0
17:13:43: llama_new_context_with_model: freq_scale = 1
17:13:43: llama_kv_cache_init:      ROCm0 KV buffer size =   392.00 MiB
17:13:43: llama_kv_cache_init:      ROCm1 KV buffer size =   192.00 MiB
17:13:43: llama_kv_cache_init:  ROCm_Host KV buffer size =    56.00 MiB
17:13:43: llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
17:13:43: llama_new_context_with_model:  ROCm_Host  output buffer size =     0.52 MiB
17:13:43: llama_new_context_with_model:      ROCm0 compute buffer size =  1331.12 MiB
17:13:43: llama_new_context_with_model:      ROCm1 compute buffer size =   324.00 MiB
17:13:43: llama_new_context_with_model:  ROCm_Host compute buffer size =    20.01 MiB
17:13:43: llama_new_context_with_model: graph nodes  = 2566
17:13:43: llama_new_context_with_model: graph splits = 96
17:13:43: time=2024-11-08T17:13:43.497+01:00 level=INFO source=server.go:606 msg="llama runner started in 10.04 seconds"
17:13:43: [GIN] 2024/11/08 - 17:13:43 | 200 | 10.121013902s |       127.0.0.1 | POST     "/api/generate"
17:13:52: [GIN] 2024/11/08 - 17:13:52 | 200 |  6.535644995s |       127.0.0.1 | POST     "/api/chat"
17:13:59: Stopping ollama.service - Ollama Service...
17:14:01: ollama.service: Deactivated successfully.
17:14:01: Stopped ollama.service - Ollama Service.
17:14:01: ollama.service: Consumed 1min 8.238s CPU time.
<!-- gh-comment-id:2465192257 --> @Escain commented on GitHub (Nov 8, 2024): ```` 17:13:28: Started ollama.service - Ollama Service. 17:13:28: 2024/11/08 17:13:28 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11433 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[*.ultibotics.com http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 17:13:28: time=2024-11-08T17:13:28.459+01:00 level=INFO source=images.go:755 msg="total blobs: 50" 17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0" 17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11433 (version 0.4.0)" 17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3814801335/runners 17:13:28: time=2024-11-08T17:13:28.505+01:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11 cuda_v12 rocm cpu cpu_avx]" 17:13:28: time=2024-11-08T17:13:28.505+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" 17:13:28: time=2024-11-08T17:13:28.515+01:00 level=INFO source=amd_linux.go:383 msg="amdgpu is supported" gpu=GPU-a98bd33a087b0622 gpu_type=gfx1100 17:13:28: time=2024-11-08T17:13:28.516+01:00 level=INFO source=amd_linux.go:383 msg="amdgpu is supported" gpu=GPU-b32cc23f0db1465d gpu_type=gfx1100 17:13:28: time=2024-11-08T17:13:28.522+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a98bd33a087b0622 library=rocm variant="" compute=gfx1100 driver=6.7 name=1002:7448 total="45.0 GiB" available="45.0 GiB" 17:13:28: time=2024-11-08T17:13:28.522+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-b32cc23f0db1465d library=rocm variant="" compute=gfx1100 driver=6.7 name=1002:744c total="24.0 GiB" available="23.0 GiB" 17:13:33: [GIN] 2024/11/08 - 17:13:33 | 200 | 40.799µs | 127.0.0.1 | HEAD "/" 17:13:33: [GIN] 2024/11/08 - 17:13:33 | 200 | 21.200866ms | 127.0.0.1 | POST "/api/show" 17:13:33: time=2024-11-08T17:13:33.452+01:00 level=INFO source=server.go:105 msg="system memory" total="377.2 GiB" free="367.2 GiB" free_swap="0 B" 17:13:33: time=2024-11-08T17:13:33.452+01:00 level=INFO source=memory.go:343 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=73 layers.split=49,24 memory.available="[45.0 GiB 23.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="74.2 GiB" memory.required.partial="67.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[44.3 GiB 22.9 GiB]" memory.weights.total="68.4 GiB" memory.weights.repeating="67.3 GiB" memory.weights.nonrepeating="1.0 GiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" 17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama3814801335/runners/rocm/ollama_llama_server --model /home/ollama_models/blobs/sha256-e1394fca2f0d8147f867e4c0bc7d1cddeb122c4d0daf50fd9a874d182a88af85 --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 73 --threads 24 --parallel 1 --tensor-split 49,24 --port 42697" 17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:567 msg="waiting for llama runner to start responding" 17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:601 msg="waiting for server to become available" status="llm server error" 17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=runner.go:869 msg="starting go runner" 17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=runner.go:870 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=24 17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:42697" 17:13:33: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /home/ollama_models/blobs/sha256-e1394fca2f0d8147f867e4c0bc7d1cddeb122c4d0daf50fd9a874d182a88af85 (version GGUF V3 (latest)) 17:13:33: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 17:13:33: llama_model_loader: - kv 0: general.architecture str = llama 17:13:33: llama_model_loader: - kv 1: general.type str = model 17:13:33: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct 17:13:33: llama_model_loader: - kv 3: general.finetune str = Instruct 17:13:33: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 17:13:33: llama_model_loader: - kv 5: general.size_label str = 70B 17:13:33: llama_model_loader: - kv 6: general.license str = llama3.1 17:13:33: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... 17:13:33: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... 17:13:33: llama_model_loader: - kv 9: llama.block_count u32 = 80 17:13:33: llama_model_loader: - kv 10: llama.context_length u32 = 131072 17:13:33: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 17:13:33: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 17:13:33: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 17:13:33: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 17:13:33: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 17:13:33: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 17:13:33: llama_model_loader: - kv 17: general.file_type u32 = 7 17:13:33: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 17:13:33: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 17:13:33: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 17:13:33: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe 17:13:33: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 17:13:33: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 17:13:33: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 17:13:33: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 17:13:33: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 17:13:33: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... 17:13:33: llama_model_loader: - kv 28: general.quantization_version u32 = 2 17:13:33: llama_model_loader: - type f32: 162 tensors 17:13:33: llama_model_loader: - type q8_0: 562 tensors 17:13:33: time=2024-11-08T17:13:33.707+01:00 level=INFO source=server.go:601 msg="waiting for server to become available" status="llm server loading model" 17:13:33: llm_load_vocab: special tokens cache size = 256 17:13:33: llm_load_vocab: token to piece cache size = 0.7999 MB 17:13:33: llm_load_print_meta: format = GGUF V3 (latest) 17:13:33: llm_load_print_meta: arch = llama 17:13:33: llm_load_print_meta: vocab type = BPE 17:13:33: llm_load_print_meta: n_vocab = 128256 17:13:33: llm_load_print_meta: n_merges = 280147 17:13:33: llm_load_print_meta: vocab_only = 0 17:13:33: llm_load_print_meta: n_ctx_train = 131072 17:13:33: llm_load_print_meta: n_embd = 8192 17:13:33: llm_load_print_meta: n_layer = 80 17:13:33: llm_load_print_meta: n_head = 64 17:13:33: llm_load_print_meta: n_head_kv = 8 17:13:33: llm_load_print_meta: n_rot = 128 17:13:33: llm_load_print_meta: n_swa = 0 17:13:33: llm_load_print_meta: n_embd_head_k = 128 17:13:33: llm_load_print_meta: n_embd_head_v = 128 17:13:33: llm_load_print_meta: n_gqa = 8 17:13:33: llm_load_print_meta: n_embd_k_gqa = 1024 17:13:33: llm_load_print_meta: n_embd_v_gqa = 1024 17:13:33: llm_load_print_meta: f_norm_eps = 0.0e+00 17:13:33: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 17:13:33: llm_load_print_meta: f_clamp_kqv = 0.0e+00 17:13:33: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 17:13:33: llm_load_print_meta: f_logit_scale = 0.0e+00 17:13:33: llm_load_print_meta: n_ff = 28672 17:13:33: llm_load_print_meta: n_expert = 0 17:13:33: llm_load_print_meta: n_expert_used = 0 17:13:33: llm_load_print_meta: causal attn = 1 17:13:33: llm_load_print_meta: pooling type = 0 17:13:33: llm_load_print_meta: rope type = 0 17:13:33: llm_load_print_meta: rope scaling = linear 17:13:33: llm_load_print_meta: freq_base_train = 500000.0 17:13:33: llm_load_print_meta: freq_scale_train = 1 17:13:33: llm_load_print_meta: n_ctx_orig_yarn = 131072 17:13:33: llm_load_print_meta: rope_finetuned = unknown 17:13:33: llm_load_print_meta: ssm_d_conv = 0 17:13:33: llm_load_print_meta: ssm_d_inner = 0 17:13:33: llm_load_print_meta: ssm_d_state = 0 17:13:33: llm_load_print_meta: ssm_dt_rank = 0 17:13:33: llm_load_print_meta: ssm_dt_b_c_rms = 0 17:13:33: llm_load_print_meta: model type = 70B 17:13:33: llm_load_print_meta: model ftype = Q8_0 17:13:33: llm_load_print_meta: model params = 70.55 B 17:13:33: llm_load_print_meta: model size = 69.82 GiB (8.50 BPW) 17:13:33: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct 17:13:33: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' 17:13:33: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' 17:13:33: llm_load_print_meta: LF token = 128 'Ä' 17:13:33: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' 17:13:33: llm_load_print_meta: EOM token = 128008 '<|eom_id|>' 17:13:33: llm_load_print_meta: EOG token = 128008 '<|eom_id|>' 17:13:33: llm_load_print_meta: EOG token = 128009 '<|eot_id|>' 17:13:33: llm_load_print_meta: max token length = 256 17:13:34: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 17:13:34: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 17:13:34: ggml_cuda_init: found 2 ROCm devices: 17:13:34: Device 0: AMD Radeon PRO W7900, compute capability 11.0, VMM: no 17:13:34: Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no 17:13:34: llm_load_tensors: ggml ctx size = 1.02 MiB 17:13:37: llm_load_tensors: offloading 73 repeating layers to GPU 17:13:37: llm_load_tensors: offloaded 73/81 layers to GPU 17:13:37: llm_load_tensors: ROCm0 buffer size = 42486.07 MiB 17:13:37: llm_load_tensors: ROCm1 buffer size = 20809.51 MiB 17:13:37: llm_load_tensors: CPU buffer size = 71494.28 MiB 17:13:43: llama_new_context_with_model: n_ctx = 2048 17:13:43: llama_new_context_with_model: n_batch = 512 17:13:43: llama_new_context_with_model: n_ubatch = 512 17:13:43: llama_new_context_with_model: flash_attn = 0 17:13:43: llama_new_context_with_model: freq_base = 500000.0 17:13:43: llama_new_context_with_model: freq_scale = 1 17:13:43: llama_kv_cache_init: ROCm0 KV buffer size = 392.00 MiB 17:13:43: llama_kv_cache_init: ROCm1 KV buffer size = 192.00 MiB 17:13:43: llama_kv_cache_init: ROCm_Host KV buffer size = 56.00 MiB 17:13:43: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB 17:13:43: llama_new_context_with_model: ROCm_Host output buffer size = 0.52 MiB 17:13:43: llama_new_context_with_model: ROCm0 compute buffer size = 1331.12 MiB 17:13:43: llama_new_context_with_model: ROCm1 compute buffer size = 324.00 MiB 17:13:43: llama_new_context_with_model: ROCm_Host compute buffer size = 20.01 MiB 17:13:43: llama_new_context_with_model: graph nodes = 2566 17:13:43: llama_new_context_with_model: graph splits = 96 17:13:43: time=2024-11-08T17:13:43.497+01:00 level=INFO source=server.go:606 msg="llama runner started in 10.04 seconds" 17:13:43: [GIN] 2024/11/08 - 17:13:43 | 200 | 10.121013902s | 127.0.0.1 | POST "/api/generate" 17:13:52: [GIN] 2024/11/08 - 17:13:52 | 200 | 6.535644995s | 127.0.0.1 | POST "/api/chat" 17:13:59: Stopping ollama.service - Ollama Service... 17:14:01: ollama.service: Deactivated successfully. 17:14:01: Stopped ollama.service - Ollama Service. 17:14:01: ollama.service: Consumed 1min 8.238s CPU time. ````
Author
Owner

@rick-github commented on GitHub (Nov 8, 2024):

Does it work better if you reduce the number of layers offloaded? It's currently offloading 73, what happens if you lower it to, say, 65, (see here, replace 0 with 65)?

<!-- gh-comment-id:2465253421 --> @rick-github commented on GitHub (Nov 8, 2024): Does it work better if you reduce the number of layers offloaded? It's currently offloading 73, what happens if you lower it to, say, 65, (see [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650), replace 0 with 65)?
Author
Owner

@Escain commented on GitHub (Nov 8, 2024):

Do you mean this:

/set parameter num_gpu 65

When I do this, the full model runs on CPU and RAM. Consequently "yes", it does sort of works, but the GPU is not used at all (no VRAM is used, not GPU is used, and it's very slow).

<!-- gh-comment-id:2465276227 --> @Escain commented on GitHub (Nov 8, 2024): Do you mean this: `/set parameter num_gpu 65` When I do this, the full model runs on CPU and RAM. Consequently "yes", it does sort of works, but the GPU is not used at all (no VRAM is used, not GPU is used, and it's very slow).
Author
Owner

@rick-github commented on GitHub (Nov 8, 2024):

It should load some of the model in the GPU. Can you post the logs for the model loaded with num_gpu=65?

<!-- gh-comment-id:2465281874 --> @rick-github commented on GitHub (Nov 8, 2024): It should load some of the model in the GPU. Can you post the logs for the model loaded with `num_gpu=65`?
Author
Owner

@Escain commented on GitHub (Nov 8, 2024):

Ok, seems it was not considered in my first attempt.

>>> hi
 said ~ages.^^. given might ' Japan japan ages.Hand gave^C

>>> /set parameter num_gpu 65
Set parameter 'num_gpu' to '65'
>>> hi
:­FormattedMessage were '^FormattedMessage^. after video case  indictment . . Tinder Tinderistan tindershapesSTALLstshapeshanENDOR and vide shapedellig Stefan item Japanese a 
shapes  boddated video Welch..
 ' Japan ' www game Trinity shaped dated.
 titled worship dated^C

Now I can see this in the log for the last chat:

Nov 08 17:59:56 Amain ollama[115026]: llm_load_tensors: offloading 65 repeating layers to GPU Nov 08 17:59:56 Amain ollama[115026]: llm_load_tensors: offloaded 65/81 layers to GPU

Then I tested several values:

num_gpu=40 # works
num_gpu=45 # works
num_gpu=50 # garbage

Between 45 and 50 is the limit where the model requires over 48GB of VRAM and starts using the second GPU.

<!-- gh-comment-id:2465290575 --> @Escain commented on GitHub (Nov 8, 2024): Ok, seems it was not considered in my first attempt. ``` >>> hi said ~ages.^^. given might ' Japan japan ages.Hand gave^C >>> /set parameter num_gpu 65 Set parameter 'num_gpu' to '65' >>> hi :­FormattedMessage were '^FormattedMessage^. after video case indictment . . Tinder Tinderistan tindershapesSTALLstshapeshanENDOR and vide shapedellig Stefan item Japanese a shapes boddated video Welch.. ' Japan ' www game Trinity shaped dated. titled worship dated^C ``` Now I can see this in the log for the last chat: ` Nov 08 17:59:56 Amain ollama[115026]: llm_load_tensors: offloading 65 repeating layers to GPU Nov 08 17:59:56 Amain ollama[115026]: llm_load_tensors: offloaded 65/81 layers to GPU ` Then I tested several values: num_gpu=40 # works num_gpu=45 # works num_gpu=50 # garbage Between 45 and 50 is the limit where the model requires over 48GB of VRAM and starts using the second GPU.
Author
Owner

@dhiltgen commented on GitHub (Nov 8, 2024):

Memory predictions are model architecture specific, and we have some miscalculations on a few models. Another workaround you can utilize until we fix the prediction is setting OLLAMA_GPU_OVERHEAD to reserve some VRAM on each GPU to cause less layers to be loaded.

<!-- gh-comment-id:2465408102 --> @dhiltgen commented on GitHub (Nov 8, 2024): Memory predictions are model architecture specific, and we have some miscalculations on a few models. Another workaround you can utilize until we fix the prediction is setting `OLLAMA_GPU_OVERHEAD` to reserve some VRAM on each GPU to cause less layers to be loaded.
Author
Owner

@Escain commented on GitHub (Nov 8, 2024):

No, this happens on ALL the models that I tested: nemotron, llama3.1, mixtral123b, reflection, etc.
I have seen NOT A SINGLE model working when split on both GPUs
While all of them work when only using one of the GPUs.

<!-- gh-comment-id:2465547001 --> @Escain commented on GitHub (Nov 8, 2024): No, this happens on ALL the models that I tested: nemotron, llama3.1, mixtral123b, reflection, etc. I have seen NOT A SINGLE model working when split on both GPUs While all of them work when only using one of the GPUs.
Author
Owner

@dhiltgen commented on GitHub (Nov 8, 2024):

Sorry I misunderstood. There might be more here than just memory predictions being incorrect.

Multi-GPU with AMD has some challenges we're still working on - #7378 might be relevant here. Take a look at the issues linked from that PR and there might be some insight to help you work through possible causes. Adjusting BIOS settings might help depending on what the root cause is.

<!-- gh-comment-id:2465560339 --> @dhiltgen commented on GitHub (Nov 8, 2024): Sorry I misunderstood. There might be more here than just memory predictions being incorrect. Multi-GPU with AMD has some challenges we're still working on - #7378 might be relevant here. Take a look at the issues linked from that PR and there might be some insight to help you work through possible causes. Adjusting BIOS settings might help depending on what the root cause is.
Author
Owner

@rick-github commented on GitHub (Nov 8, 2024):

Is it possible that the second GPU has an issue? What happens if you use ROCR_VISIBLE_DEVICES to force a single model to load on the second GPU?

<!-- gh-comment-id:2465598139 --> @rick-github commented on GitHub (Nov 8, 2024): Is it possible that the second GPU has an issue? What happens if you use `ROCR_VISIBLE_DEVICES` to force a single model to load on the second GPU?
Author
Owner

@Escain commented on GitHub (Nov 8, 2024):

I found several dmesg errors as following:

... # many of the same
[17679.900951] amdgpu: init_user_pages: Failed to get user pages: -1
[17679.921145] amdgpu: init_user_pages: Failed to get user pages: -1
[17683.730637] amdgpu 0000:e3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0x11d90028580 flags=0x0020]
[17683.730656] amdgpu 0000:e3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0x11d90021980 flags=0x0020]
... # many of the same.

I believe this could be a good hint against ROCM driver.

Is it possible that the second GPU has an issue? What happens if you use ROCR_VISIBLE_DEVICES to force a single model to load on the second GPU?

Both GPUs seem to work properly, I can run models on both individually without issues.

7378

Using OLLAMA_NO_PEER_COPY=0 or OLLAMA_NO_PEER_COPY=1 seems to have no impact.

<!-- gh-comment-id:2465696569 --> @Escain commented on GitHub (Nov 8, 2024): I found several dmesg errors as following: ```` ... # many of the same [17679.900951] amdgpu: init_user_pages: Failed to get user pages: -1 [17679.921145] amdgpu: init_user_pages: Failed to get user pages: -1 [17683.730637] amdgpu 0000:e3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0x11d90028580 flags=0x0020] [17683.730656] amdgpu 0000:e3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0x11d90021980 flags=0x0020] ... # many of the same. ```` I believe this could be a good hint against ROCM driver. > Is it possible that the second GPU has an issue? What happens if you use ROCR_VISIBLE_DEVICES to force a single model to load on the second GPU? Both GPUs seem to work properly, I can run models on both individually without issues. > 7378 Using `OLLAMA_NO_PEER_COPY=0` or `OLLAMA_NO_PEER_COPY=1` seems to have no impact.
Author
Owner

@dhiltgen commented on GitHub (Nov 8, 2024):

7378

Using OLLAMA_NO_PEER_COPY=0 or OLLAMA_NO_PEER_COPY=1 seems to have no impact.

That PR isn't merged yet, so you would have to build from source from my branch for this env var to be supported. We're not positive yet if this is the strategy we're going to take to resolve this, but if you do try, and it resolves the problem for you, that's a good data point to advocate for that PR getting merged.

https://github.com/ollama/ollama/blob/main/docs/development.md

<!-- gh-comment-id:2465860047 --> @dhiltgen commented on GitHub (Nov 8, 2024): > > 7378 > > Using OLLAMA_NO_PEER_COPY=0 or OLLAMA_NO_PEER_COPY=1 seems to have no impact. That PR isn't merged yet, so you would have to build from source from my branch for this env var to be supported. We're not positive yet if this is the strategy we're going to take to resolve this, but if you do try, and it resolves the problem for you, that's a good data point to advocate for that PR getting merged. https://github.com/ollama/ollama/blob/main/docs/development.md
Author
Owner

@joe2gaan commented on GitHub (Nov 11, 2024):

Adding the Linux kernel parameters below worked for me on my 5x AMD Instinct MI60 setup.

Recommended Kernel Parameters for Intel Xeon with AMD GPU
IOMMU Settings:

intel_iommu=on – Enables IOMMU for Intel CPUs, allowing the kernel to manage memory for the GPU.
iommu=pt – Sets IOMMU to pass-through mode, ideal for GPUs needing direct memory access.
Memory and Page Table Permissions:

iommu.strict=1 – Enables strict memory allocation, often helpful for high-memory usage setups like GPU workloads.
iommu.passthrough=1 – Allows devices, such as GPUs, to access assigned memory regions directly.
Page Table Handling:

iommu.force_aperture=1 – Forces IOMMU to use a single memory region, beneficial for stability with large memory regions.
iommu=fullflush – Ensures IOMMU performs full cache flushes, reducing potential memory conflicts.
PCIe and GPU-Specific Optimizations:

pcie_aspm=off – Disables Active State Power Management, preventing potential latency issues with high-performance GPUs.
amdgpu.vm_update_mode=3 – Improves virtual memory updates with amdgpu when handling multiple GPUs.
Diagnostics and Fault Tolerance:

loglevel=7 – Enables verbose kernel logging, which is useful for detailed fault diagnostics.
iommu=soft – As a fallback, enables software IOMMU if hardware IOMMU has issues.

I hope this helps someone.

<!-- gh-comment-id:2468261039 --> @joe2gaan commented on GitHub (Nov 11, 2024): Adding the Linux kernel parameters below worked for me on my 5x AMD Instinct MI60 setup. Recommended Kernel Parameters for Intel Xeon with AMD GPU IOMMU Settings: intel_iommu=on – Enables IOMMU for Intel CPUs, allowing the kernel to manage memory for the GPU. iommu=pt – Sets IOMMU to pass-through mode, ideal for GPUs needing direct memory access. Memory and Page Table Permissions: iommu.strict=1 – Enables strict memory allocation, often helpful for high-memory usage setups like GPU workloads. iommu.passthrough=1 – Allows devices, such as GPUs, to access assigned memory regions directly. Page Table Handling: iommu.force_aperture=1 – Forces IOMMU to use a single memory region, beneficial for stability with large memory regions. iommu=fullflush – Ensures IOMMU performs full cache flushes, reducing potential memory conflicts. PCIe and GPU-Specific Optimizations: pcie_aspm=off – Disables Active State Power Management, preventing potential latency issues with high-performance GPUs. amdgpu.vm_update_mode=3 – Improves virtual memory updates with amdgpu when handling multiple GPUs. Diagnostics and Fault Tolerance: loglevel=7 – Enables verbose kernel logging, which is useful for detailed fault diagnostics. iommu=soft – As a fallback, enables software IOMMU if hardware IOMMU has issues. I hope this helps someone.
Author
Owner

@Escain commented on GitHub (Nov 19, 2024):

That is correct.
I applied:

$sudo nano /etc/default/grub

#Append IOMMU at the end of  GRUB_CMDLINE_LINUX_DEFAULT 
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt iommu.strict=1 iommu.passthrough=1 amdgpu.vm_update_mode=3"

$sudo update-grub
$sudo reboot

Now I can run 70B Q8 models over the two GPUs.

I close this ticket as not strictly related with Ollama.

<!-- gh-comment-id:2484831476 --> @Escain commented on GitHub (Nov 19, 2024): That is correct. I applied: ```` $sudo nano /etc/default/grub #Append IOMMU at the end of GRUB_CMDLINE_LINUX_DEFAULT GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt iommu.strict=1 iommu.passthrough=1 amdgpu.vm_update_mode=3" $sudo update-grub $sudo reboot ```` Now I can run 70B Q8 models over the two GPUs. I close this ticket as not strictly related with Ollama.
Author
Owner

@joe2gaan commented on GitHub (Nov 19, 2024):

Outstanding!

*Joseph A. Williams, II MSIT *
*"To Accomplish that which people don't, you must be willing *

  • to Endure that which people won't..." **~*Joseph Williams 2016

On Tue, Nov 19, 2024 at 1:48 AM Escain @.***> wrote:

That is correct.
I applied:

$sudo nano /etc/default/grub

#Append IOMMU at the end of GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt iommu.strict=1 iommu.passthrough=1 amdgpu.vm_update_mode=3"

$sudo update-grub
$sudo reboot

Now I can run 70B Q8 models over the two GPUs.

I close this ticket as not strictly related with Ollama.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/7575#issuecomment-2484831476,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGFR6POL6EPIEQGAQJCP3J32BLNMVAVCNFSM6AAAAABRNWLGDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBUHAZTCNBXGY
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:2486201974 --> @joe2gaan commented on GitHub (Nov 19, 2024): Outstanding! *Joseph A. Williams, II MSIT * *"To Accomplish that which people don't, you must be willing * * to Endure that which people won't..." **~**Joseph Williams 2016* On Tue, Nov 19, 2024 at 1:48 AM Escain ***@***.***> wrote: > That is correct. > I applied: > > $sudo nano /etc/default/grub > > #Append IOMMU at the end of GRUB_CMDLINE_LINUX_DEFAULT > GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt iommu.strict=1 iommu.passthrough=1 amdgpu.vm_update_mode=3" > > $sudo update-grub > $sudo reboot > > > Now I can run 70B Q8 models over the two GPUs. > > I close this ticket as not strictly related with Ollama. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/7575#issuecomment-2484831476>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGFR6POL6EPIEQGAQJCP3J32BLNMVAVCNFSM6AAAAABRNWLGDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBUHAZTCNBXGY> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4828