[GH-ISSUE #7575] Multi-GPU returning garbage #4828

New Issue

GiteaMirror · 2026-04-12T15:49:19-05:00

GiteaMirror commented

2026-04-12 15:49:19 -05:00

Originally created by @Escain on GitHub (Nov 8, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7575

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I recently upgrade a computer with a new additional GPU.

When running a model that fit on a single GPU, all works fine: the command answer and I see the GPU RAM and usage while it respond.
Before installing the second GPU, I could run models requiring more memory than the actual VRAM, and (even if slow), it was working.
Since I installed the second GPU, models requiring VRAM from both GPUs only shows "GGGGGGGGGGGGG" or garbage.

ollama -v
ollama version is 0.4.0
Version 0.3.14 had the same issue.

System:
  Kernel: 6.1.0-26-amd64 arch: x86_64 bits: 64 compiler: gcc v: 12.2.0
  Desktop: KDE Plasma v: 5.27.5 Distro: Debian GNU/Linux 12 (bookworm)
CPU:
  Info: 24-core model: AMD Ryzen Threadripper PRO 7965WX s bits: 64
Memory:
  Total: 377Gi
Graphics:
  Device-1: AMD Navi 31 [Radeon Pro W7900] driver: amdgpu v: 6.3.6
    arch: RDNA-3 bus-ID: e3:00.0
  Device-2: AMD Navi 31 [Radeon RX 7900 XT/7900 XTX] vendor: Gigabyte
    driver: amdgpu v: 6.3.6 arch: RDNA-3 bus-ID: e6:00.0

I tested with several rocm versions: 6.0.x, 6.1.4 and 6.2.2

Examples:

granite-code 20b-instruct-8k-q8_0: works fine, executed on the W7900.
nemotron 70b-instruct-q8_0: answering always "GGGGGGGG..."/garbage, executed on both GPUs.
I tested several models: llama3.1 8B and 70B in Q8, etc.

Maybe related to this: https://github.com/ollama/ollama/issues/6356

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.4.0

Originally created by @Escain on GitHub (Nov 8, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7575 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I recently upgrade a computer with a new additional GPU. When running a model that fit on a single GPU, all works fine: the command answer and I see the GPU RAM and usage while it respond. Before installing the second GPU, I could run models requiring more memory than the actual VRAM, and (even if slow), it was working. Since I installed the second GPU, models requiring VRAM from both GPUs only shows "GGGGGGGGGGGGG" or garbage. ``` ollama -v ollama version is 0.4.0 Version 0.3.14 had the same issue. System: Kernel: 6.1.0-26-amd64 arch: x86_64 bits: 64 compiler: gcc v: 12.2.0 Desktop: KDE Plasma v: 5.27.5 Distro: Debian GNU/Linux 12 (bookworm) CPU: Info: 24-core model: AMD Ryzen Threadripper PRO 7965WX s bits: 64 Memory: Total: 377Gi Graphics: Device-1: AMD Navi 31 [Radeon Pro W7900] driver: amdgpu v: 6.3.6 arch: RDNA-3 bus-ID: e3:00.0 Device-2: AMD Navi 31 [Radeon RX 7900 XT/7900 XTX] vendor: Gigabyte driver: amdgpu v: 6.3.6 arch: RDNA-3 bus-ID: e6:00.0 ``` I tested with several rocm versions: 6.0.x, 6.1.4 and 6.2.2 Examples: granite-code 20b-instruct-8k-q8_0: works fine, executed on the W7900. nemotron 70b-instruct-q8_0: answering always "GGGGGGGG..."/garbage, executed on both GPUs. I tested several models: llama3.1 8B and 70B in Q8, etc. Maybe related to this: https://github.com/ollama/ollama/issues/6356 ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.4.0

GiteaMirror added the bug amd labels 2026-04-12 15:49:19 -05:00

GiteaMirror closed this issue

2026-04-12 15:49:20 -05:00

GiteaMirror commented

2026-04-12 15:49:21 -05:00

@rick-github commented on GitHub (Nov 8, 2024):

If you add server logs it will provide more context for debugging.

@rick-github commented on GitHub (Nov 8, 2024): If you add [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) it will provide more context for debugging.

GiteaMirror commented

2026-04-12 15:49:21 -05:00

@Escain commented on GitHub (Nov 8, 2024):

17:13:28: Started ollama.service - Ollama Service.
17:13:28: 2024/11/08 17:13:28 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11433 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[*.ultibotics.com http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
17:13:28: time=2024-11-08T17:13:28.459+01:00 level=INFO source=images.go:755 msg="total blobs: 50"
17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0"
17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11433 (version 0.4.0)"
17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3814801335/runners
17:13:28: time=2024-11-08T17:13:28.505+01:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11 cuda_v12 rocm cpu cpu_avx]"
17:13:28: time=2024-11-08T17:13:28.505+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
17:13:28: time=2024-11-08T17:13:28.515+01:00 level=INFO source=amd_linux.go:383 msg="amdgpu is supported" gpu=GPU-a98bd33a087b0622 gpu_type=gfx1100
17:13:28: time=2024-11-08T17:13:28.516+01:00 level=INFO source=amd_linux.go:383 msg="amdgpu is supported" gpu=GPU-b32cc23f0db1465d gpu_type=gfx1100
17:13:28: time=2024-11-08T17:13:28.522+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a98bd33a087b0622 library=rocm variant="" compute=gfx1100 driver=6.7 name=1002:7448 total="45.0 GiB" available="45.0 GiB"
17:13:28: time=2024-11-08T17:13:28.522+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-b32cc23f0db1465d library=rocm variant="" compute=gfx1100 driver=6.7 name=1002:744c total="24.0 GiB" available="23.0 GiB"
17:13:33: [GIN] 2024/11/08 - 17:13:33 | 200 |      40.799µs |       127.0.0.1 | HEAD     "/"
17:13:33: [GIN] 2024/11/08 - 17:13:33 | 200 |   21.200866ms |       127.0.0.1 | POST     "/api/show"
17:13:33: time=2024-11-08T17:13:33.452+01:00 level=INFO source=server.go:105 msg="system memory" total="377.2 GiB" free="367.2 GiB" free_swap="0 B"
17:13:33: time=2024-11-08T17:13:33.452+01:00 level=INFO source=memory.go:343 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=73 layers.split=49,24 memory.available="[45.0 GiB 23.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="74.2 GiB" memory.required.partial="67.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[44.3 GiB 22.9 GiB]" memory.weights.total="68.4 GiB" memory.weights.repeating="67.3 GiB" memory.weights.nonrepeating="1.0 GiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama3814801335/runners/rocm/ollama_llama_server --model /home/ollama_models/blobs/sha256-e1394fca2f0d8147f867e4c0bc7d1cddeb122c4d0daf50fd9a874d182a88af85 --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 73 --threads 24 --parallel 1 --tensor-split 49,24 --port 42697"
17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:567 msg="waiting for llama runner to start responding"
17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:601 msg="waiting for server to become available" status="llm server error"
17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=runner.go:869 msg="starting go runner"
17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=runner.go:870 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=24
17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:42697"
17:13:33: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /home/ollama_models/blobs/sha256-e1394fca2f0d8147f867e4c0bc7d1cddeb122c4d0daf50fd9a874d182a88af85 (version GGUF V3 (latest))
17:13:33: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
17:13:33: llama_model_loader: - kv   0:                       general.architecture str              = llama
17:13:33: llama_model_loader: - kv   1:                               general.type str              = model
17:13:33: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 70B Instruct
17:13:33: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
17:13:33: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
17:13:33: llama_model_loader: - kv   5:                         general.size_label str              = 70B
17:13:33: llama_model_loader: - kv   6:                            general.license str              = llama3.1
17:13:33: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
17:13:33: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
17:13:33: llama_model_loader: - kv   9:                          llama.block_count u32              = 80
17:13:33: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
17:13:33: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 8192
17:13:33: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 28672
17:13:33: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 64
17:13:33: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
17:13:33: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
17:13:33: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
17:13:33: llama_model_loader: - kv  17:                          general.file_type u32              = 7
17:13:33: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
17:13:33: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
17:13:33: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
17:13:33: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
17:13:33: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
17:13:33: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
17:13:33: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
17:13:33: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
17:13:33: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
17:13:33: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
17:13:33: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
17:13:33: llama_model_loader: - type  f32:  162 tensors
17:13:33: llama_model_loader: - type q8_0:  562 tensors
17:13:33: time=2024-11-08T17:13:33.707+01:00 level=INFO source=server.go:601 msg="waiting for server to become available" status="llm server loading model"
17:13:33: llm_load_vocab: special tokens cache size = 256
17:13:33: llm_load_vocab: token to piece cache size = 0.7999 MB
17:13:33: llm_load_print_meta: format           = GGUF V3 (latest)
17:13:33: llm_load_print_meta: arch             = llama
17:13:33: llm_load_print_meta: vocab type       = BPE
17:13:33: llm_load_print_meta: n_vocab          = 128256
17:13:33: llm_load_print_meta: n_merges         = 280147
17:13:33: llm_load_print_meta: vocab_only       = 0
17:13:33: llm_load_print_meta: n_ctx_train      = 131072
17:13:33: llm_load_print_meta: n_embd           = 8192
17:13:33: llm_load_print_meta: n_layer          = 80
17:13:33: llm_load_print_meta: n_head           = 64
17:13:33: llm_load_print_meta: n_head_kv        = 8
17:13:33: llm_load_print_meta: n_rot            = 128
17:13:33: llm_load_print_meta: n_swa            = 0
17:13:33: llm_load_print_meta: n_embd_head_k    = 128
17:13:33: llm_load_print_meta: n_embd_head_v    = 128
17:13:33: llm_load_print_meta: n_gqa            = 8
17:13:33: llm_load_print_meta: n_embd_k_gqa     = 1024
17:13:33: llm_load_print_meta: n_embd_v_gqa     = 1024
17:13:33: llm_load_print_meta: f_norm_eps       = 0.0e+00
17:13:33: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
17:13:33: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
17:13:33: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
17:13:33: llm_load_print_meta: f_logit_scale    = 0.0e+00
17:13:33: llm_load_print_meta: n_ff             = 28672
17:13:33: llm_load_print_meta: n_expert         = 0
17:13:33: llm_load_print_meta: n_expert_used    = 0
17:13:33: llm_load_print_meta: causal attn      = 1
17:13:33: llm_load_print_meta: pooling type     = 0
17:13:33: llm_load_print_meta: rope type        = 0
17:13:33: llm_load_print_meta: rope scaling     = linear
17:13:33: llm_load_print_meta: freq_base_train  = 500000.0
17:13:33: llm_load_print_meta: freq_scale_train = 1
17:13:33: llm_load_print_meta: n_ctx_orig_yarn  = 131072
17:13:33: llm_load_print_meta: rope_finetuned   = unknown
17:13:33: llm_load_print_meta: ssm_d_conv       = 0
17:13:33: llm_load_print_meta: ssm_d_inner      = 0
17:13:33: llm_load_print_meta: ssm_d_state      = 0
17:13:33: llm_load_print_meta: ssm_dt_rank      = 0
17:13:33: llm_load_print_meta: ssm_dt_b_c_rms   = 0
17:13:33: llm_load_print_meta: model type       = 70B
17:13:33: llm_load_print_meta: model ftype      = Q8_0
17:13:33: llm_load_print_meta: model params     = 70.55 B
17:13:33: llm_load_print_meta: model size       = 69.82 GiB (8.50 BPW)
17:13:33: llm_load_print_meta: general.name     = Meta Llama 3.1 70B Instruct
17:13:33: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
17:13:33: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
17:13:33: llm_load_print_meta: LF token         = 128 'Ä'
17:13:33: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
17:13:33: llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
17:13:33: llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
17:13:33: llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
17:13:33: llm_load_print_meta: max token length = 256
17:13:34: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
17:13:34: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
17:13:34: ggml_cuda_init: found 2 ROCm devices:
17:13:34:   Device 0: AMD Radeon PRO W7900, compute capability 11.0, VMM: no
17:13:34:   Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
17:13:34: llm_load_tensors: ggml ctx size =    1.02 MiB
17:13:37: llm_load_tensors: offloading 73 repeating layers to GPU
17:13:37: llm_load_tensors: offloaded 73/81 layers to GPU
17:13:37: llm_load_tensors:      ROCm0 buffer size = 42486.07 MiB
17:13:37: llm_load_tensors:      ROCm1 buffer size = 20809.51 MiB
17:13:37: llm_load_tensors:        CPU buffer size = 71494.28 MiB
17:13:43: llama_new_context_with_model: n_ctx      = 2048
17:13:43: llama_new_context_with_model: n_batch    = 512
17:13:43: llama_new_context_with_model: n_ubatch   = 512
17:13:43: llama_new_context_with_model: flash_attn = 0
17:13:43: llama_new_context_with_model: freq_base  = 500000.0
17:13:43: llama_new_context_with_model: freq_scale = 1
17:13:43: llama_kv_cache_init:      ROCm0 KV buffer size =   392.00 MiB
17:13:43: llama_kv_cache_init:      ROCm1 KV buffer size =   192.00 MiB
17:13:43: llama_kv_cache_init:  ROCm_Host KV buffer size =    56.00 MiB
17:13:43: llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
17:13:43: llama_new_context_with_model:  ROCm_Host  output buffer size =     0.52 MiB
17:13:43: llama_new_context_with_model:      ROCm0 compute buffer size =  1331.12 MiB
17:13:43: llama_new_context_with_model:      ROCm1 compute buffer size =   324.00 MiB
17:13:43: llama_new_context_with_model:  ROCm_Host compute buffer size =    20.01 MiB
17:13:43: llama_new_context_with_model: graph nodes  = 2566
17:13:43: llama_new_context_with_model: graph splits = 96
17:13:43: time=2024-11-08T17:13:43.497+01:00 level=INFO source=server.go:606 msg="llama runner started in 10.04 seconds"
17:13:43: [GIN] 2024/11/08 - 17:13:43 | 200 | 10.121013902s |       127.0.0.1 | POST     "/api/generate"
17:13:52: [GIN] 2024/11/08 - 17:13:52 | 200 |  6.535644995s |       127.0.0.1 | POST     "/api/chat"
17:13:59: Stopping ollama.service - Ollama Service...
17:14:01: ollama.service: Deactivated successfully.
17:14:01: Stopped ollama.service - Ollama Service.
17:14:01: ollama.service: Consumed 1min 8.238s CPU time.

@Escain commented on GitHub (Nov 8, 2024): ```` 17:13:28: Started ollama.service - Ollama Service. 17:13:28: 2024/11/08 17:13:28 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11433 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[*.ultibotics.com http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 17:13:28: time=2024-11-08T17:13:28.459+01:00 level=INFO source=images.go:755 msg="total blobs: 50" 17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0" 17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11433 (version 0.4.0)" 17:13:28: time=2024-11-08T17:13:28.460+01:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3814801335/runners 17:13:28: time=2024-11-08T17:13:28.505+01:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11 cuda_v12 rocm cpu cpu_avx]" 17:13:28: time=2024-11-08T17:13:28.505+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" 17:13:28: time=2024-11-08T17:13:28.515+01:00 level=INFO source=amd_linux.go:383 msg="amdgpu is supported" gpu=GPU-a98bd33a087b0622 gpu_type=gfx1100 17:13:28: time=2024-11-08T17:13:28.516+01:00 level=INFO source=amd_linux.go:383 msg="amdgpu is supported" gpu=GPU-b32cc23f0db1465d gpu_type=gfx1100 17:13:28: time=2024-11-08T17:13:28.522+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a98bd33a087b0622 library=rocm variant="" compute=gfx1100 driver=6.7 name=1002:7448 total="45.0 GiB" available="45.0 GiB" 17:13:28: time=2024-11-08T17:13:28.522+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-b32cc23f0db1465d library=rocm variant="" compute=gfx1100 driver=6.7 name=1002:744c total="24.0 GiB" available="23.0 GiB" 17:13:33: [GIN] 2024/11/08 - 17:13:33 | 200 | 40.799µs | 127.0.0.1 | HEAD "/" 17:13:33: [GIN] 2024/11/08 - 17:13:33 | 200 | 21.200866ms | 127.0.0.1 | POST "/api/show" 17:13:33: time=2024-11-08T17:13:33.452+01:00 level=INFO source=server.go:105 msg="system memory" total="377.2 GiB" free="367.2 GiB" free_swap="0 B" 17:13:33: time=2024-11-08T17:13:33.452+01:00 level=INFO source=memory.go:343 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=73 layers.split=49,24 memory.available="[45.0 GiB 23.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="74.2 GiB" memory.required.partial="67.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[44.3 GiB 22.9 GiB]" memory.weights.total="68.4 GiB" memory.weights.repeating="67.3 GiB" memory.weights.nonrepeating="1.0 GiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" 17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama3814801335/runners/rocm/ollama_llama_server --model /home/ollama_models/blobs/sha256-e1394fca2f0d8147f867e4c0bc7d1cddeb122c4d0daf50fd9a874d182a88af85 --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 73 --threads 24 --parallel 1 --tensor-split 49,24 --port 42697" 17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:567 msg="waiting for llama runner to start responding" 17:13:33: time=2024-11-08T17:13:33.455+01:00 level=INFO source=server.go:601 msg="waiting for server to become available" status="llm server error" 17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=runner.go:869 msg="starting go runner" 17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=runner.go:870 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=24 17:13:33: time=2024-11-08T17:13:33.486+01:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:42697" 17:13:33: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /home/ollama_models/blobs/sha256-e1394fca2f0d8147f867e4c0bc7d1cddeb122c4d0daf50fd9a874d182a88af85 (version GGUF V3 (latest)) 17:13:33: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 17:13:33: llama_model_loader: - kv 0: general.architecture str = llama 17:13:33: llama_model_loader: - kv 1: general.type str = model 17:13:33: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct 17:13:33: llama_model_loader: - kv 3: general.finetune str = Instruct 17:13:33: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 17:13:33: llama_model_loader: - kv 5: general.size_label str = 70B 17:13:33: llama_model_loader: - kv 6: general.license str = llama3.1 17:13:33: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... 17:13:33: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... 17:13:33: llama_model_loader: - kv 9: llama.block_count u32 = 80 17:13:33: llama_model_loader: - kv 10: llama.context_length u32 = 131072 17:13:33: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 17:13:33: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 17:13:33: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 17:13:33: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 17:13:33: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 17:13:33: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 17:13:33: llama_model_loader: - kv 17: general.file_type u32 = 7 17:13:33: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 17:13:33: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 17:13:33: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 17:13:33: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe 17:13:33: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 17:13:33: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 17:13:33: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 17:13:33: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 17:13:33: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 17:13:33: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... 17:13:33: llama_model_loader: - kv 28: general.quantization_version u32 = 2 17:13:33: llama_model_loader: - type f32: 162 tensors 17:13:33: llama_model_loader: - type q8_0: 562 tensors 17:13:33: time=2024-11-08T17:13:33.707+01:00 level=INFO source=server.go:601 msg="waiting for server to become available" status="llm server loading model" 17:13:33: llm_load_vocab: special tokens cache size = 256 17:13:33: llm_load_vocab: token to piece cache size = 0.7999 MB 17:13:33: llm_load_print_meta: format = GGUF V3 (latest) 17:13:33: llm_load_print_meta: arch = llama 17:13:33: llm_load_print_meta: vocab type = BPE 17:13:33: llm_load_print_meta: n_vocab = 128256 17:13:33: llm_load_print_meta: n_merges = 280147 17:13:33: llm_load_print_meta: vocab_only = 0 17:13:33: llm_load_print_meta: n_ctx_train = 131072 17:13:33: llm_load_print_meta: n_embd = 8192 17:13:33: llm_load_print_meta: n_layer = 80 17:13:33: llm_load_print_meta: n_head = 64 17:13:33: llm_load_print_meta: n_head_kv = 8 17:13:33: llm_load_print_meta: n_rot = 128 17:13:33: llm_load_print_meta: n_swa = 0 17:13:33: llm_load_print_meta: n_embd_head_k = 128 17:13:33: llm_load_print_meta: n_embd_head_v = 128 17:13:33: llm_load_print_meta: n_gqa = 8 17:13:33: llm_load_print_meta: n_embd_k_gqa = 1024 17:13:33: llm_load_print_meta: n_embd_v_gqa = 1024 17:13:33: llm_load_print_meta: f_norm_eps = 0.0e+00 17:13:33: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 17:13:33: llm_load_print_meta: f_clamp_kqv = 0.0e+00 17:13:33: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 17:13:33: llm_load_print_meta: f_logit_scale = 0.0e+00 17:13:33: llm_load_print_meta: n_ff = 28672 17:13:33: llm_load_print_meta: n_expert = 0 17:13:33: llm_load_print_meta: n_expert_used = 0 17:13:33: llm_load_print_meta: causal attn = 1 17:13:33: llm_load_print_meta: pooling type = 0 17:13:33: llm_load_print_meta: rope type = 0 17:13:33: llm_load_print_meta: rope scaling = linear 17:13:33: llm_load_print_meta: freq_base_train = 500000.0 17:13:33: llm_load_print_meta: freq_scale_train = 1 17:13:33: llm_load_print_meta: n_ctx_orig_yarn = 131072 17:13:33: llm_load_print_meta: rope_finetuned = unknown 17:13:33: llm_load_print_meta: ssm_d_conv = 0 17:13:33: llm_load_print_meta: ssm_d_inner = 0 17:13:33: llm_load_print_meta: ssm_d_state = 0 17:13:33: llm_load_print_meta: ssm_dt_rank = 0 17:13:33: llm_load_print_meta: ssm_dt_b_c_rms = 0 17:13:33: llm_load_print_meta: model type = 70B 17:13:33: llm_load_print_meta: model ftype = Q8_0 17:13:33: llm_load_print_meta: model params = 70.55 B 17:13:33: llm_load_print_meta: model size = 69.82 GiB (8.50 BPW) 17:13:33: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct 17:13:33: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' 17:13:33: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' 17:13:33: llm_load_print_meta: LF token = 128 'Ä' 17:13:33: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' 17:13:33: llm_load_print_meta: EOM token = 128008 '<|eom_id|>' 17:13:33: llm_load_print_meta: EOG token = 128008 '<|eom_id|>' 17:13:33: llm_load_print_meta: EOG token = 128009 '<|eot_id|>' 17:13:33: llm_load_print_meta: max token length = 256 17:13:34: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 17:13:34: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 17:13:34: ggml_cuda_init: found 2 ROCm devices: 17:13:34: Device 0: AMD Radeon PRO W7900, compute capability 11.0, VMM: no 17:13:34: Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no 17:13:34: llm_load_tensors: ggml ctx size = 1.02 MiB 17:13:37: llm_load_tensors: offloading 73 repeating layers to GPU 17:13:37: llm_load_tensors: offloaded 73/81 layers to GPU 17:13:37: llm_load_tensors: ROCm0 buffer size = 42486.07 MiB 17:13:37: llm_load_tensors: ROCm1 buffer size = 20809.51 MiB 17:13:37: llm_load_tensors: CPU buffer size = 71494.28 MiB 17:13:43: llama_new_context_with_model: n_ctx = 2048 17:13:43: llama_new_context_with_model: n_batch = 512 17:13:43: llama_new_context_with_model: n_ubatch = 512 17:13:43: llama_new_context_with_model: flash_attn = 0 17:13:43: llama_new_context_with_model: freq_base = 500000.0 17:13:43: llama_new_context_with_model: freq_scale = 1 17:13:43: llama_kv_cache_init: ROCm0 KV buffer size = 392.00 MiB 17:13:43: llama_kv_cache_init: ROCm1 KV buffer size = 192.00 MiB 17:13:43: llama_kv_cache_init: ROCm_Host KV buffer size = 56.00 MiB 17:13:43: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB 17:13:43: llama_new_context_with_model: ROCm_Host output buffer size = 0.52 MiB 17:13:43: llama_new_context_with_model: ROCm0 compute buffer size = 1331.12 MiB 17:13:43: llama_new_context_with_model: ROCm1 compute buffer size = 324.00 MiB 17:13:43: llama_new_context_with_model: ROCm_Host compute buffer size = 20.01 MiB 17:13:43: llama_new_context_with_model: graph nodes = 2566 17:13:43: llama_new_context_with_model: graph splits = 96 17:13:43: time=2024-11-08T17:13:43.497+01:00 level=INFO source=server.go:606 msg="llama runner started in 10.04 seconds" 17:13:43: [GIN] 2024/11/08 - 17:13:43 | 200 | 10.121013902s | 127.0.0.1 | POST "/api/generate" 17:13:52: [GIN] 2024/11/08 - 17:13:52 | 200 | 6.535644995s | 127.0.0.1 | POST "/api/chat" 17:13:59: Stopping ollama.service - Ollama Service... 17:14:01: ollama.service: Deactivated successfully. 17:14:01: Stopped ollama.service - Ollama Service. 17:14:01: ollama.service: Consumed 1min 8.238s CPU time. ````

GiteaMirror commented

2026-04-12 15:49:22 -05:00

@rick-github commented on GitHub (Nov 8, 2024):

Does it work better if you reduce the number of layers offloaded? It's currently offloading 73, what happens if you lower it to, say, 65, (see here, replace 0 with 65)?

@rick-github commented on GitHub (Nov 8, 2024): Does it work better if you reduce the number of layers offloaded? It's currently offloading 73, what happens if you lower it to, say, 65, (see [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650), replace 0 with 65)?

GiteaMirror commented

2026-04-12 15:49:22 -05:00

@Escain commented on GitHub (Nov 8, 2024):

Do you mean this:

/set parameter num_gpu 65

When I do this, the full model runs on CPU and RAM. Consequently "yes", it does sort of works, but the GPU is not used at all (no VRAM is used, not GPU is used, and it's very slow).

@Escain commented on GitHub (Nov 8, 2024): Do you mean this: `/set parameter num_gpu 65` When I do this, the full model runs on CPU and RAM. Consequently "yes", it does sort of works, but the GPU is not used at all (no VRAM is used, not GPU is used, and it's very slow).

GiteaMirror commented

2026-04-12 15:49:23 -05:00

@rick-github commented on GitHub (Nov 8, 2024):

It should load some of the model in the GPU. Can you post the logs for the model loaded with num_gpu=65?

@rick-github commented on GitHub (Nov 8, 2024): It should load some of the model in the GPU. Can you post the logs for the model loaded with `num_gpu=65`?

GiteaMirror commented

2026-04-12 15:49:23 -05:00

@Escain commented on GitHub (Nov 8, 2024):

Ok, seems it was not considered in my first attempt.

>>> hi
 said ~ages.^^. given might ' Japan japan ages.Hand gave^C

>>> /set parameter num_gpu 65
Set parameter 'num_gpu' to '65'
>>> hi
:FormattedMessage were '^FormattedMessage^. after video case  indictment . . Tinder Tinderistan tindershapesSTALLstshapeshanENDOR and vide shapedellig Stefan item Japanese a 
shapes  boddated video Welch..
 ' Japan ' www game Trinity shaped dated.
 titled worship dated^C

Now I can see this in the log for the last chat:

Nov 08 17:59:56 Amain ollama[115026]: llm_load_tensors: offloading 65 repeating layers to GPU Nov 08 17:59:56 Amain ollama[115026]: llm_load_tensors: offloaded 65/81 layers to GPU

Then I tested several values:

num_gpu=40 # works
num_gpu=45 # works
num_gpu=50 # garbage

Between 45 and 50 is the limit where the model requires over 48GB of VRAM and starts using the second GPU.

@Escain commented on GitHub (Nov 8, 2024): Ok, seems it was not considered in my first attempt. ``` >>> hi said ~ages.^^. given might ' Japan japan ages.Hand gave^C >>> /set parameter num_gpu 65 Set parameter 'num_gpu' to '65' >>> hi :FormattedMessage were '^FormattedMessage^. after video case indictment . . Tinder Tinderistan tindershapesSTALLstshapeshanENDOR and vide shapedellig Stefan item Japanese a shapes boddated video Welch.. ' Japan ' www game Trinity shaped dated. titled worship dated^C ``` Now I can see this in the log for the last chat: ` Nov 08 17:59:56 Amain ollama[115026]: llm_load_tensors: offloading 65 repeating layers to GPU Nov 08 17:59:56 Amain ollama[115026]: llm_load_tensors: offloaded 65/81 layers to GPU ` Then I tested several values: num_gpu=40 # works num_gpu=45 # works num_gpu=50 # garbage Between 45 and 50 is the limit where the model requires over 48GB of VRAM and starts using the second GPU.

GiteaMirror commented

2026-04-12 15:49:24 -05:00

@dhiltgen commented on GitHub (Nov 8, 2024):

Memory predictions are model architecture specific, and we have some miscalculations on a few models. Another workaround you can utilize until we fix the prediction is setting OLLAMA_GPU_OVERHEAD to reserve some VRAM on each GPU to cause less layers to be loaded.

@dhiltgen commented on GitHub (Nov 8, 2024): Memory predictions are model architecture specific, and we have some miscalculations on a few models. Another workaround you can utilize until we fix the prediction is setting `OLLAMA_GPU_OVERHEAD` to reserve some VRAM on each GPU to cause less layers to be loaded.

GiteaMirror commented

2026-04-12 15:49:25 -05:00

@Escain commented on GitHub (Nov 8, 2024):

No, this happens on ALL the models that I tested: nemotron, llama3.1, mixtral123b, reflection, etc.
I have seen NOT A SINGLE model working when split on both GPUs
While all of them work when only using one of the GPUs.

@Escain commented on GitHub (Nov 8, 2024): No, this happens on ALL the models that I tested: nemotron, llama3.1, mixtral123b, reflection, etc. I have seen NOT A SINGLE model working when split on both GPUs While all of them work when only using one of the GPUs.

GiteaMirror commented

2026-04-12 15:49:25 -05:00

@dhiltgen commented on GitHub (Nov 8, 2024):

Sorry I misunderstood. There might be more here than just memory predictions being incorrect.

Multi-GPU with AMD has some challenges we're still working on - #7378 might be relevant here. Take a look at the issues linked from that PR and there might be some insight to help you work through possible causes. Adjusting BIOS settings might help depending on what the root cause is.

@dhiltgen commented on GitHub (Nov 8, 2024): Sorry I misunderstood. There might be more here than just memory predictions being incorrect. Multi-GPU with AMD has some challenges we're still working on - #7378 might be relevant here. Take a look at the issues linked from that PR and there might be some insight to help you work through possible causes. Adjusting BIOS settings might help depending on what the root cause is.

GiteaMirror commented

2026-04-12 15:49:26 -05:00

@rick-github commented on GitHub (Nov 8, 2024):

Is it possible that the second GPU has an issue? What happens if you use ROCR_VISIBLE_DEVICES to force a single model to load on the second GPU?

@rick-github commented on GitHub (Nov 8, 2024): Is it possible that the second GPU has an issue? What happens if you use `ROCR_VISIBLE_DEVICES` to force a single model to load on the second GPU?

GiteaMirror commented

2026-04-12 15:49:26 -05:00

@Escain commented on GitHub (Nov 8, 2024):

I found several dmesg errors as following:

... # many of the same
[17679.900951] amdgpu: init_user_pages: Failed to get user pages: -1
[17679.921145] amdgpu: init_user_pages: Failed to get user pages: -1
[17683.730637] amdgpu 0000:e3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0x11d90028580 flags=0x0020]
[17683.730656] amdgpu 0000:e3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0x11d90021980 flags=0x0020]
... # many of the same.

I believe this could be a good hint against ROCM driver.

Is it possible that the second GPU has an issue? What happens if you use ROCR_VISIBLE_DEVICES to force a single model to load on the second GPU?

Both GPUs seem to work properly, I can run models on both individually without issues.

7378

Using OLLAMA_NO_PEER_COPY=0 or OLLAMA_NO_PEER_COPY=1 seems to have no impact.

@Escain commented on GitHub (Nov 8, 2024): I found several dmesg errors as following: ```` ... # many of the same [17679.900951] amdgpu: init_user_pages: Failed to get user pages: -1 [17679.921145] amdgpu: init_user_pages: Failed to get user pages: -1 [17683.730637] amdgpu 0000:e3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0x11d90028580 flags=0x0020] [17683.730656] amdgpu 0000:e3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0x11d90021980 flags=0x0020] ... # many of the same. ```` I believe this could be a good hint against ROCM driver. > Is it possible that the second GPU has an issue? What happens if you use ROCR_VISIBLE_DEVICES to force a single model to load on the second GPU? Both GPUs seem to work properly, I can run models on both individually without issues. > 7378 Using `OLLAMA_NO_PEER_COPY=0` or `OLLAMA_NO_PEER_COPY=1` seems to have no impact.

GiteaMirror commented

2026-04-12 15:49:27 -05:00

@dhiltgen commented on GitHub (Nov 8, 2024):

7378

Using OLLAMA_NO_PEER_COPY=0 or OLLAMA_NO_PEER_COPY=1 seems to have no impact.

That PR isn't merged yet, so you would have to build from source from my branch for this env var to be supported. We're not positive yet if this is the strategy we're going to take to resolve this, but if you do try, and it resolves the problem for you, that's a good data point to advocate for that PR getting merged.

https://github.com/ollama/ollama/blob/main/docs/development.md

@dhiltgen commented on GitHub (Nov 8, 2024): > > 7378 > > Using OLLAMA_NO_PEER_COPY=0 or OLLAMA_NO_PEER_COPY=1 seems to have no impact. That PR isn't merged yet, so you would have to build from source from my branch for this env var to be supported. We're not positive yet if this is the strategy we're going to take to resolve this, but if you do try, and it resolves the problem for you, that's a good data point to advocate for that PR getting merged. https://github.com/ollama/ollama/blob/main/docs/development.md

GiteaMirror commented

2026-04-12 15:49:28 -05:00

@joe2gaan commented on GitHub (Nov 11, 2024):

Adding the Linux kernel parameters below worked for me on my 5x AMD Instinct MI60 setup.

Recommended Kernel Parameters for Intel Xeon with AMD GPU
IOMMU Settings:

intel_iommu=on – Enables IOMMU for Intel CPUs, allowing the kernel to manage memory for the GPU.
iommu=pt – Sets IOMMU to pass-through mode, ideal for GPUs needing direct memory access.
Memory and Page Table Permissions:

iommu.strict=1 – Enables strict memory allocation, often helpful for high-memory usage setups like GPU workloads.
iommu.passthrough=1 – Allows devices, such as GPUs, to access assigned memory regions directly.
Page Table Handling:

iommu.force_aperture=1 – Forces IOMMU to use a single memory region, beneficial for stability with large memory regions.
iommu=fullflush – Ensures IOMMU performs full cache flushes, reducing potential memory conflicts.
PCIe and GPU-Specific Optimizations:

pcie_aspm=off – Disables Active State Power Management, preventing potential latency issues with high-performance GPUs.
amdgpu.vm_update_mode=3 – Improves virtual memory updates with amdgpu when handling multiple GPUs.
Diagnostics and Fault Tolerance:

loglevel=7 – Enables verbose kernel logging, which is useful for detailed fault diagnostics.
iommu=soft – As a fallback, enables software IOMMU if hardware IOMMU has issues.

I hope this helps someone.

@joe2gaan commented on GitHub (Nov 11, 2024): Adding the Linux kernel parameters below worked for me on my 5x AMD Instinct MI60 setup. Recommended Kernel Parameters for Intel Xeon with AMD GPU IOMMU Settings: intel_iommu=on – Enables IOMMU for Intel CPUs, allowing the kernel to manage memory for the GPU. iommu=pt – Sets IOMMU to pass-through mode, ideal for GPUs needing direct memory access. Memory and Page Table Permissions: iommu.strict=1 – Enables strict memory allocation, often helpful for high-memory usage setups like GPU workloads. iommu.passthrough=1 – Allows devices, such as GPUs, to access assigned memory regions directly. Page Table Handling: iommu.force_aperture=1 – Forces IOMMU to use a single memory region, beneficial for stability with large memory regions. iommu=fullflush – Ensures IOMMU performs full cache flushes, reducing potential memory conflicts. PCIe and GPU-Specific Optimizations: pcie_aspm=off – Disables Active State Power Management, preventing potential latency issues with high-performance GPUs. amdgpu.vm_update_mode=3 – Improves virtual memory updates with amdgpu when handling multiple GPUs. Diagnostics and Fault Tolerance: loglevel=7 – Enables verbose kernel logging, which is useful for detailed fault diagnostics. iommu=soft – As a fallback, enables software IOMMU if hardware IOMMU has issues. I hope this helps someone.

GiteaMirror commented

2026-04-12 15:49:29 -05:00

@Escain commented on GitHub (Nov 19, 2024):

That is correct.
I applied:

$sudo nano /etc/default/grub

#Append IOMMU at the end of  GRUB_CMDLINE_LINUX_DEFAULT 
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt iommu.strict=1 iommu.passthrough=1 amdgpu.vm_update_mode=3"

$sudo update-grub
$sudo reboot

Now I can run 70B Q8 models over the two GPUs.

I close this ticket as not strictly related with Ollama.

@Escain commented on GitHub (Nov 19, 2024): That is correct. I applied: ```` $sudo nano /etc/default/grub #Append IOMMU at the end of GRUB_CMDLINE_LINUX_DEFAULT GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt iommu.strict=1 iommu.passthrough=1 amdgpu.vm_update_mode=3" $sudo update-grub $sudo reboot ```` Now I can run 70B Q8 models over the two GPUs. I close this ticket as not strictly related with Ollama.

GiteaMirror commented

2026-04-12 15:49:30 -05:00

@joe2gaan commented on GitHub (Nov 19, 2024):

Outstanding!

*Joseph A. Williams, II MSIT *
*"To Accomplish that which people don't, you must be willing *

to Endure that which people won't..." **~*Joseph Williams 2016

On Tue, Nov 19, 2024 at 1:48 AM Escain @.***> wrote:

That is correct.
I applied:

$sudo nano /etc/default/grub

#Append IOMMU at the end of GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt iommu.strict=1 iommu.passthrough=1 amdgpu.vm_update_mode=3"

$sudo update-grub
$sudo reboot

Now I can run 70B Q8 models over the two GPUs.

I close this ticket as not strictly related with Ollama.

—
Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/7575#issuecomment-2484831476,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGFR6POL6EPIEQGAQJCP3J32BLNMVAVCNFSM6AAAAABRNWLGDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBUHAZTCNBXGY
.
You are receiving this because you commented.Message ID:
@.***>

@joe2gaan commented on GitHub (Nov 19, 2024): Outstanding! *Joseph A. Williams, II MSIT * *"To Accomplish that which people don't, you must be willing * * to Endure that which people won't..." **~**Joseph Williams 2016* On Tue, Nov 19, 2024 at 1:48 AM Escain ***@***.***> wrote: > That is correct. > I applied: > > $sudo nano /etc/default/grub > > #Append IOMMU at the end of GRUB_CMDLINE_LINUX_DEFAULT > GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt iommu.strict=1 iommu.passthrough=1 amdgpu.vm_update_mode=3" > > $sudo update-grub > $sudo reboot > > > Now I can run 70B Q8 models over the two GPUs. > > I close this ticket as not strictly related with Ollama. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/7575#issuecomment-2484831476>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGFR6POL6EPIEQGAQJCP3J32BLNMVAVCNFSM6AAAAABRNWLGDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBUHAZTCNBXGY> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

GiteaMirror referenced this issue

2026-04-22 07:22:16 -05:00

[GH-ISSUE #4828] Ability to choose different installation location in Windows #28812

GiteaMirror referenced this issue

2026-04-22 09:14:50 -05:00

[GH-ISSUE #6574] [Windows] Select installation location #29900

GiteaMirror referenced this issue

2026-04-28 12:16:30 -05:00

[GH-ISSUE #4828] Ability to choose different installation location in Windows #49563

GiteaMirror referenced this issue

2026-04-28 16:45:01 -05:00

[GH-ISSUE #6574] [Windows] Select installation location #50651

GiteaMirror referenced this issue

2026-05-03 19:43:59 -05:00

[GH-ISSUE #4828] Ability to choose different installation location in Windows #65089

GiteaMirror referenced this issue

2026-05-04 00:29:29 -05:00

[GH-ISSUE #6574] [Windows] Select installation location #66177

Sign in to join this conversation.

Branches Tags

main

parth-mlx-decode-checkpoints

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#4828