[GH-ISSUE #2637] Integrated AMD GPU support #1559

Closed
opened 2026-04-12 11:28:07 -05:00 by GiteaMirror · 171 comments
Owner

Originally created by @DocMAX on GitHub (Feb 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2637

Originally assigned to: @dhiltgen on GitHub.

Opening a new issue (see https://github.com/ollama/ollama/pull/2195) to track support for integrated GPUs. I have a AMD 5800U CPU with integrated graphics. As far as i did research ROCR lately does support integrated graphics too.

Currently Ollama seems to ignore iGPUs in general.

Originally created by @DocMAX on GitHub (Feb 21, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2637 Originally assigned to: @dhiltgen on GitHub. Opening a new issue (see https://github.com/ollama/ollama/pull/2195) to track support for integrated GPUs. I have a AMD 5800U CPU with integrated graphics. As far as i did research ROCR lately does support integrated graphics too. Currently Ollama seems to ignore iGPUs in general.
GiteaMirror added the feature requestamd labels 2026-04-12 11:28:07 -05:00
Author
Owner

@GZGavinZhao commented on GitHub (Feb 22, 2024):

ROCm's support for integrated GPUs is not that well. This issue may largely depend on AMD's progress on improving ROCm.

<!-- gh-comment-id:1958455510 --> @GZGavinZhao commented on GitHub (Feb 22, 2024): ROCm's support for integrated GPUs is not that well. This issue may largely depend on AMD's progress on improving ROCm.
Author
Owner

@DocMAX commented on GitHub (Feb 22, 2024):

OK, but i would like to have an option to have it enable. Just to check if it works.

<!-- gh-comment-id:1959019146 --> @DocMAX commented on GitHub (Feb 22, 2024): OK, but i would like to have an option to have it enable. Just to check if it works.
Author
Owner

@DocMAX commented on GitHub (Feb 22, 2024):

This is what i get with the new docker image (rocm support). Detects Radeon and then says no GPU detected?!?

image

image

<!-- gh-comment-id:1959739579 --> @DocMAX commented on GitHub (Feb 22, 2024): This is what i get with the new docker image (rocm support). Detects Radeon and then says no GPU detected?!? ![image](https://github.com/ollama/ollama/assets/5351323/f2fc1aae-f8fa-415f-a6ba-fa6e1d3b662f) ![image](https://github.com/ollama/ollama/assets/5351323/3bfaf432-d5a9-4c07-85b3-858614a7f161)
Author
Owner

@GZGavinZhao commented on GitHub (Feb 22, 2024):

Their AMDDetected() function is a bit broken and I haven't figured out a fix for it.

<!-- gh-comment-id:1959784764 --> @GZGavinZhao commented on GitHub (Feb 22, 2024): Their `AMDDetected()` function is a bit broken and I haven't figured out a fix for it.
Author
Owner

@sid-cypher commented on GitHub (Feb 23, 2024):

I've seen this behavior in #2411, but only with the version from ollama.com.
Try it with the latest released binary?
https://github.com/ollama/ollama/releases/tag/v0.1.27

<!-- gh-comment-id:1961415543 --> @sid-cypher commented on GitHub (Feb 23, 2024): I've seen this behavior in #2411, but only with the version from ollama.com. Try it with the latest released binary? https://github.com/ollama/ollama/releases/tag/v0.1.27
Author
Owner

@GZGavinZhao commented on GitHub (Feb 23, 2024):

Yes, latest release fixed this behavior.

<!-- gh-comment-id:1961616317 --> @GZGavinZhao commented on GitHub (Feb 23, 2024): Yes, latest release fixed this behavior.
Author
Owner

@DocMAX commented on GitHub (Feb 23, 2024):

I had a permission issue with lxc/docker. Now:

time=2024-02-23T19:27:29.715Z level=INFO source=images.go:710 msg="total blobs: 31"
time=2024-02-23T19:27:29.716Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-23T19:27:29.717Z level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.1.27)"
time=2024-02-23T19:27:29.717Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-23T19:27:33.385Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx rocm_v6 rocm_v5 cuda_v11 cpu_avx2]"
time=2024-02-23T19:27:33.385Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-23T19:27:33.385Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-23T19:27:33.387Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-23T19:27:33.387Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-23T19:27:33.388Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.5.0.50701 /opt/rocm-5.7.1/lib/librocm_smi64.so.5.0.50701]"
time=2024-02-23T19:27:33.391Z level=INFO source=gpu.go:109 msg="Radeon GPU detected"
time=2024-02-23T19:27:33.391Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-23T19:27:33.391Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected"
time=2024-02-23T19:27:33.392Z level=INFO source=routes.go:1042 msg="no GPU detected"

So as the topic says, please add integrated GPU support (AMD 5800U here)

<!-- gh-comment-id:1961867037 --> @DocMAX commented on GitHub (Feb 23, 2024): I had a permission issue with lxc/docker. Now: ``` time=2024-02-23T19:27:29.715Z level=INFO source=images.go:710 msg="total blobs: 31" time=2024-02-23T19:27:29.716Z level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-02-23T19:27:29.717Z level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.1.27)" time=2024-02-23T19:27:29.717Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-02-23T19:27:33.385Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx rocm_v6 rocm_v5 cuda_v11 cpu_avx2]" time=2024-02-23T19:27:33.385Z level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-02-23T19:27:33.385Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so" time=2024-02-23T19:27:33.387Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []" time=2024-02-23T19:27:33.387Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so" time=2024-02-23T19:27:33.388Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.5.0.50701 /opt/rocm-5.7.1/lib/librocm_smi64.so.5.0.50701]" time=2024-02-23T19:27:33.391Z level=INFO source=gpu.go:109 msg="Radeon GPU detected" time=2024-02-23T19:27:33.391Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-23T19:27:33.391Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected" time=2024-02-23T19:27:33.392Z level=INFO source=routes.go:1042 msg="no GPU detected" ``` So as the topic says, please add integrated GPU support (AMD 5800U here)
Author
Owner

@robertvazan commented on GitHub (Feb 24, 2024):

Latest (0.1.27) docker image with ROCm works for me on Ryzen 5600G with 8GB VRAM allocation. Prompt processing is 2x faster than with CPU. Generation runs at max speed even if CPU is busy running other processes. I am on Fedora 39.

Container setup:

  • HSA_OVERRIDE_GFX_VERSION=9.0.0
  • HCC_AMDGPU_TARGETS=gfx900 (unnecessary)
  • share devices: /dev/dri/card1, /dev/dri/renderD128, /dev/dri, /dev/kfd
  • additional options: --group-add video --security-opt seccomp:unconfined (unnecessary)

It's however still shaky:

  • With topk1, output should be fully reproducible, but first iGPU generation differs from the following ones for the same prompt. Both first and following iGPU generations differ from what CPU produces. Differences are minor though.
  • Output is sometimes garbage on iGPU as if the prompt is ignored. Restarting ollama fixes the problem.
  • Ollama often fails to offload all layers to the iGPU when switching models, reporting low VRAM as if parts of the previous model are still in VRAM. Restarting ollama fixes the problem for a while.
  • Partial offload with 13B model works, but mixtral is broken. It just hangs.
<!-- gh-comment-id:1962229028 --> @robertvazan commented on GitHub (Feb 24, 2024): Latest (0.1.27) docker image with ROCm works for me on Ryzen 5600G with 8GB VRAM allocation. Prompt processing is 2x faster than with CPU. Generation runs at max speed even if CPU is busy running other processes. I am on Fedora 39. Container setup: - HSA_OVERRIDE_GFX_VERSION=9.0.0 - ~~HCC_AMDGPU_TARGETS=gfx900~~ (unnecessary) - share devices: ~~/dev/dri/card1, /dev/dri/renderD128~~, /dev/dri, /dev/kfd - ~~additional options: `--group-add video --security-opt seccomp:unconfined`~~ (unnecessary) It's however still shaky: - With topk1, output should be fully reproducible, but first iGPU generation differs from the following ones for the same prompt. Both first and following iGPU generations differ from what CPU produces. Differences are minor though. - Output is sometimes garbage on iGPU as if the prompt is ignored. Restarting ollama fixes the problem. - Ollama often fails to offload all layers to the iGPU when switching models, reporting low VRAM as if parts of the previous model are still in VRAM. Restarting ollama fixes the problem for a while. - Partial offload with 13B model works, but mixtral is broken. It just hangs.
Author
Owner

@robertvazan commented on GitHub (Feb 24, 2024):

See also discussion in the #738 epic.

<!-- gh-comment-id:1962232714 --> @robertvazan commented on GitHub (Feb 24, 2024): See also discussion in the #738 epic.
Author
Owner

@DocMAX commented on GitHub (Feb 24, 2024):

Why does it work for you??
Still not working here.

services:
  ollama:
    #image: ollama/ollama:latest
    image: ollama/ollama:0.1.27-rocm
    container_name: ollama
    volumes:
      - data:/root/.ollama
    restart: unless-stopped
    devices:
      - /dev/dri
      - /dev/kfd
    security_opt:
      - "seccomp:unconfined"
    group_add:
      - video
    environment:
      - 'HSA_OVERRIDE_GFX_VERSION=9.0.0'
      - 'HCC_AMDGPU_TARGETS=gfx900'
time=2024-02-24T10:16:09.280Z level=INFO source=images.go:710 msg="total blobs: 31"
time=2024-02-24T10:16:09.284Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-24T10:16:09.285Z level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.1.27)"
time=2024-02-24T10:16:09.285Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-24T10:16:12.184Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v5 rocm_v6 cpu cpu_avx cpu_avx2 cuda_v11]"
time=2024-02-24T10:16:12.184Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-24T10:16:12.184Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-24T10:16:12.188Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-24T10:16:12.188Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-24T10:16:12.189Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.5.0.50701 /opt/rocm-5.7.1/lib/librocm_smi64.so.5.0.50701]"
time=2024-02-24T10:16:12.191Z level=INFO source=gpu.go:109 msg="Radeon GPU detected"
time=2024-02-24T10:16:12.191Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-24T10:16:12.192Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected"
time=2024-02-24T10:16:12.192Z level=INFO source=routes.go:1042 msg="no GPU detected"

Also the non-docker version doesnt work...

root@ollama:~# HCC_AMDGPU_TARGETS=gfx900 HSA_OVERRIDE_GFX_VERSION=9.0.0 LD_LIBRARY_PATH=/usr/lib ollama serve
time=2024-02-24T10:40:14.582Z level=INFO source=images.go:710 msg="total blobs: 0"
time=2024-02-24T10:40:14.582Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-24T10:40:14.583Z level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
time=2024-02-24T10:40:14.583Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-24T10:40:17.691Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 rocm_v6 cuda_v11 rocm_v5 cpu]"
time=2024-02-24T10:40:17.691Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-24T10:40:17.691Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-24T10:40:17.692Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-24T10:40:17.692Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-24T10:40:17.693Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/librocm_smi64.so.1.0]"
time=2024-02-24T10:40:17.696Z level=INFO source=gpu.go:109 msg="Radeon GPU detected"
time=2024-02-24T10:40:17.696Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-24T10:40:17.696Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected"
time=2024-02-24T10:40:17.696Z level=INFO source=routes.go:1042 msg="no GPU detected"
root@ollama:~# rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 7 5800H with Radeon Graphics
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 7 5800H with Radeon Graphics
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4463
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            16
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    65216764(0x3e320fc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65216764(0x3e320fc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    65216764(0x3e320fc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx90c
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      1024(0x400) KB
  Chip ID:                 5688(0x1638)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2000
  BDFID:                   1536
  Internal Node ID:        1
  Compute Unit:            8
  SIMDs per CU:            4
  Shader Engines:          1
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

@dhiltgen please have a look

<!-- gh-comment-id:1962319744 --> @DocMAX commented on GitHub (Feb 24, 2024): Why does it work for you?? Still not working here. ``` services: ollama: #image: ollama/ollama:latest image: ollama/ollama:0.1.27-rocm container_name: ollama volumes: - data:/root/.ollama restart: unless-stopped devices: - /dev/dri - /dev/kfd security_opt: - "seccomp:unconfined" group_add: - video environment: - 'HSA_OVERRIDE_GFX_VERSION=9.0.0' - 'HCC_AMDGPU_TARGETS=gfx900' ``` ``` time=2024-02-24T10:16:09.280Z level=INFO source=images.go:710 msg="total blobs: 31" time=2024-02-24T10:16:09.284Z level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-02-24T10:16:09.285Z level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.1.27)" time=2024-02-24T10:16:09.285Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-02-24T10:16:12.184Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v5 rocm_v6 cpu cpu_avx cpu_avx2 cuda_v11]" time=2024-02-24T10:16:12.184Z level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-02-24T10:16:12.184Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so" time=2024-02-24T10:16:12.188Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []" time=2024-02-24T10:16:12.188Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so" time=2024-02-24T10:16:12.189Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.5.0.50701 /opt/rocm-5.7.1/lib/librocm_smi64.so.5.0.50701]" time=2024-02-24T10:16:12.191Z level=INFO source=gpu.go:109 msg="Radeon GPU detected" time=2024-02-24T10:16:12.191Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-24T10:16:12.192Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected" time=2024-02-24T10:16:12.192Z level=INFO source=routes.go:1042 msg="no GPU detected" ``` Also the non-docker version doesnt work... ``` root@ollama:~# HCC_AMDGPU_TARGETS=gfx900 HSA_OVERRIDE_GFX_VERSION=9.0.0 LD_LIBRARY_PATH=/usr/lib ollama serve time=2024-02-24T10:40:14.582Z level=INFO source=images.go:710 msg="total blobs: 0" time=2024-02-24T10:40:14.582Z level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-02-24T10:40:14.583Z level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-02-24T10:40:14.583Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-02-24T10:40:17.691Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 rocm_v6 cuda_v11 rocm_v5 cpu]" time=2024-02-24T10:40:17.691Z level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-02-24T10:40:17.691Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so" time=2024-02-24T10:40:17.692Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []" time=2024-02-24T10:40:17.692Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so" time=2024-02-24T10:40:17.693Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/librocm_smi64.so.1.0]" time=2024-02-24T10:40:17.696Z level=INFO source=gpu.go:109 msg="Radeon GPU detected" time=2024-02-24T10:40:17.696Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-24T10:40:17.696Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected" time=2024-02-24T10:40:17.696Z level=INFO source=routes.go:1042 msg="no GPU detected" ``` ``` root@ollama:~# rocminfo ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 7 5800H with Radeon Graphics Uuid: CPU-XX Marketing Name: AMD Ryzen 7 5800H with Radeon Graphics Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 4463 BDFID: 0 Internal Node ID: 0 Compute Unit: 16 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 65216764(0x3e320fc) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 65216764(0x3e320fc) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 65216764(0x3e320fc) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx90c Uuid: GPU-XX Marketing Name: AMD Radeon Graphics Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 1024(0x400) KB Chip ID: 5688(0x1638) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2000 BDFID: 1536 Internal Node ID: 1 Compute Unit: 8 SIMDs per CU: 4 Shader Engines: 1 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 524288(0x80000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx90c:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ``` @dhiltgen please have a look
Author
Owner

@DocMAX commented on GitHub (Feb 24, 2024):

And by the way there is no /sys/module/amdgpu/version. You have to correct the code.

<!-- gh-comment-id:1962337435 --> @DocMAX commented on GitHub (Feb 24, 2024): And by the way there is no /sys/module/amdgpu/version. You have to correct the code.
Author
Owner

@robertvazan commented on GitHub (Feb 24, 2024):

ROCm unsupported integrated GPU detected

Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB.

<!-- gh-comment-id:1962338577 --> @robertvazan commented on GitHub (Feb 24, 2024): > ROCm unsupported integrated GPU detected Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB.
Author
Owner

@DocMAX commented on GitHub (Feb 24, 2024):

Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB.

Thanks i will check if i can do that.
But normal behaviour for the iGPU should be that it requests more VRAM if needed.

<!-- gh-comment-id:1962338913 --> @DocMAX commented on GitHub (Feb 24, 2024): > Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB. Thanks i will check if i can do that. But normal behaviour for the iGPU should be that it requests more VRAM if needed.
Author
Owner

@robertvazan commented on GitHub (Feb 24, 2024):

But normal behaviour for the iGPU should be that it requests more VRAM if needed.

Why do you think so? Where is it documented? Mine maxes at 512MB unless I explicitly configure it in BIOS.

<!-- gh-comment-id:1962339275 --> @robertvazan commented on GitHub (Feb 24, 2024): > But normal behaviour for the iGPU should be that it requests more VRAM if needed. Why do you think so? Where is it documented? Mine maxes at 512MB unless I explicitly configure it in BIOS.
Author
Owner

@sid-cypher commented on GitHub (Feb 24, 2024):

Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB.

Detecting and using this VRAM information without sharing with the user the reason for the iGPU rejection leads to "missing support" issues being opened, rather than "increase my VRAM allocation" steps taken. I think the log output should be improved in this case. This task would probably qualify for a "good first issue" tag, too.

<!-- gh-comment-id:1962361876 --> @sid-cypher commented on GitHub (Feb 24, 2024): > Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB. Detecting and using this VRAM information without sharing with the user the reason for the iGPU rejection leads to "missing support" issues being opened, rather than "increase my VRAM allocation" steps taken. I think the log output should be improved in this case. This task would probably qualify for a "good first issue" tag, too.
Author
Owner

@DocMAX commented on GitHub (Feb 24, 2024):

Totally agree!

<!-- gh-comment-id:1962362195 --> @DocMAX commented on GitHub (Feb 24, 2024): Totally agree!
Author
Owner

@chiragkrishna commented on GitHub (Feb 24, 2024):

i have 2 systems.
Ryzen 5500U system always gets stuck here. ive allotted 4gb vram for it in the bios. its the max.

export HSA_OVERRIDE_GFX_VERSION=9.0.0
export HCC_AMDGPU_TARGETS=gfx900

llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:      ROCm0 buffer size =   703.44 MiB
llm_load_tensors:        CPU buffer size =    35.44 MiB

building with

export CGO_CFLAGS="-g"
export AMDGPU_TARGETS="gfx1030;gfx900"
go generate ./...
go build .

my 6750xt system works perfectly

<!-- gh-comment-id:1962363092 --> @chiragkrishna commented on GitHub (Feb 24, 2024): i have 2 systems. Ryzen 5500U system always gets stuck here. ive allotted 4gb vram for it in the bios. its the max. export HSA_OVERRIDE_GFX_VERSION=9.0.0 export HCC_AMDGPU_TARGETS=gfx900 ``` llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 25/25 layers to GPU llm_load_tensors: ROCm0 buffer size = 703.44 MiB llm_load_tensors: CPU buffer size = 35.44 MiB ``` building with ``` export CGO_CFLAGS="-g" export AMDGPU_TARGETS="gfx1030;gfx900" go generate ./... go build . ``` my 6750xt system works perfectly
Author
Owner

@DocMAX commented on GitHub (Feb 24, 2024):

But normal behaviour for the iGPU should be that it requests more VRAM if needed.

Why do you think so? Where is it documented? Mine maxes at 512MB unless I explicitly configure it in BIOS.

OK i was wrong. Works now with 8GB VRAM, thank you!

discovered 1 ROCm GPU Devices
[0] ROCm device name: Cezanne [Radeon Vega Series / Radeon Vega Mobile Series]
[0] ROCm brand: Cezanne [Radeon Vega Series / Radeon Vega Mobile Series]
[0] ROCm vendor: Advanced Micro Devices, Inc. [AMD/ATI]
[0] ROCm VRAM vendor: unknown
[0] ROCm S/N: 
[0] ROCm subsystem name: 0x123
[0] ROCm vbios version: 113-CEZANNE-018
[0] ROCm totalMem 8589934592
[0] ROCm usedMem 25907200
time=2024-02-24T18:27:14.013Z level=DEBUG source=gpu.go:254 msg="rocm detected 1 devices with 7143M available memory"
<!-- gh-comment-id:1962482899 --> @DocMAX commented on GitHub (Feb 24, 2024): > > But normal behaviour for the iGPU should be that it requests more VRAM if needed. > > Why do you think so? Where is it documented? Mine maxes at 512MB unless I explicitly configure it in BIOS. OK i was wrong. Works now with 8GB VRAM, thank you! ``` discovered 1 ROCm GPU Devices [0] ROCm device name: Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [0] ROCm brand: Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [0] ROCm vendor: Advanced Micro Devices, Inc. [AMD/ATI] [0] ROCm VRAM vendor: unknown [0] ROCm S/N: [0] ROCm subsystem name: 0x123 [0] ROCm vbios version: 113-CEZANNE-018 [0] ROCm totalMem 8589934592 [0] ROCm usedMem 25907200 time=2024-02-24T18:27:14.013Z level=DEBUG source=gpu.go:254 msg="rocm detected 1 devices with 7143M available memory" ```
Author
Owner

@DocMAX commented on GitHub (Feb 24, 2024):

Hmm, i see the model loaded into VRAM, but nothing happens...

llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
<!-- gh-comment-id:1962710973 --> @DocMAX commented on GitHub (Feb 24, 2024): Hmm, i see the model loaded into VRAM, but nothing happens... ``` llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 3577.56 MiB llm_load_tensors: CPU buffer size = 70.31 MiB ```
Author
Owner

@DocMAX commented on GitHub (Feb 24, 2024):

Do i need another amdgpu module on the host than the one from the kernel (6.7.6)?

<!-- gh-comment-id:1962731924 --> @DocMAX commented on GitHub (Feb 24, 2024): Do i need another amdgpu module on the host than the one from the kernel (6.7.6)?
Author
Owner

@sid-cypher commented on GitHub (Feb 24, 2024):

Do i need another amdgpu module on the host than the one from the kernel (6.7.6)?

Maybe, https://github.com/ROCm/ROCm/issues/816 seems relevant. I'm just using AMD-provided DKMS modules from https://repo.radeon.com/amdgpu/6.0.2/ubuntu to be sure.

<!-- gh-comment-id:1962738277 --> @sid-cypher commented on GitHub (Feb 24, 2024): > Do i need another amdgpu module on the host than the one from the kernel (6.7.6)? Maybe, https://github.com/ROCm/ROCm/issues/816 seems relevant. I'm just using AMD-provided DKMS modules from https://repo.radeon.com/amdgpu/6.0.2/ubuntu to be sure.
Author
Owner

@DocMAX commented on GitHub (Feb 24, 2024):

Hmm, tinyllama model does work with 5800U. The bigger ones stuck as i mentioned before.
Edit: Codellama works too.

<!-- gh-comment-id:1962746440 --> @DocMAX commented on GitHub (Feb 24, 2024): Hmm, tinyllama model does work with 5800U. The bigger ones stuck as i mentioned before. Edit: Codellama works too.
Author
Owner

@chiragkrishna commented on GitHub (Feb 25, 2024):

i added this "-DLLAMA_HIP_UMA=ON" to "ollama/llm/generate/gen_linux.sh"

CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)"

now its stuck here

llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:      ROCm0 buffer size =   809.59 MiB
llm_load_tensors:        CPU buffer size =    51.27 MiB
...............................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =    44.00 MiB
llama_new_context_with_model: KV self size  =   44.00 MiB, K (f16):   22.00 MiB, V (f16):   22.00 MiB
llama_new_context_with_model:  ROCm_Host input buffer size   =     9.02 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   148.01 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     4.00 MiB
llama_new_context_with_model: graph splits (measure): 3
[1708857011] warming up the model with an empty run
<!-- gh-comment-id:1962888886 --> @chiragkrishna commented on GitHub (Feb 25, 2024): i added this "-DLLAMA_HIP_UMA=ON" to "ollama/llm/generate/gen_linux.sh" ``` CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)" ``` now its stuck here ``` llm_load_tensors: offloading 22 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 23/23 layers to GPU llm_load_tensors: ROCm0 buffer size = 809.59 MiB llm_load_tensors: CPU buffer size = 51.27 MiB ............................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 44.00 MiB llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB llama_new_context_with_model: ROCm_Host input buffer size = 9.02 MiB llama_new_context_with_model: ROCm0 compute buffer size = 148.01 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 4.00 MiB llama_new_context_with_model: graph splits (measure): 3 [1708857011] warming up the model with an empty run ```
Author
Owner

@robertvazan commented on GitHub (Feb 25, 2024):

iGPUs indeed do allocate system RAM on demand. It's called GTT/GART. Here's what I get when I run sudo dmesg | grep "M of" on my system with 32GB RAM:

If I set VRAM to Auto in BIOS:

[    4.654736] [drm] amdgpu: 512M of VRAM memory ready
[    4.654737] [drm] amdgpu: 15688M of GTT memory ready.

If I set VRAM to 8GB in BIOS:

[    4.670921] [drm] amdgpu: 8192M of VRAM memory ready
[    4.670923] [drm] amdgpu: 11908M of GTT memory ready.

If I set VRAM to 16GB in BIOS:

[    4.600060] [drm] amdgpu: 16384M of VRAM memory ready
[    4.600062] [drm] amdgpu: 7888M of GTT memory ready.

It looks like GTT size is 0.5*(RAM-VRAM). I wonder how far can this go if you have 64GB or 96GB RAM. Can you have iGPU with 32GB or 48GB of GTT memory? That would make $200 APU with $200 DDR5 RAM superior to $2,000 dGPU for running Mixtral and future sparse models. I also wonder whether any BIOS offers 32GB VRAM setting if you have 64GB of RAM.

Unfortunately, ROCm does not use GTT. That thread mentions several workarounds (torch-apu-helper, force-host-alloction-APU, Rusticl, unlock VRAM allocation), but I am not sure whether Ollama would be able to use any of them. Chances are highest in docker container where Ollama has greatest control over dependencies.

<!-- gh-comment-id:1962936970 --> @robertvazan commented on GitHub (Feb 25, 2024): iGPUs indeed do allocate system RAM on demand. It's called [GTT/GART](https://en.wikipedia.org/wiki/Graphics_address_remapping_table). Here's what I get when I run `sudo dmesg | grep "M of"` on my system with 32GB RAM: If I set VRAM to Auto in BIOS: ``` [ 4.654736] [drm] amdgpu: 512M of VRAM memory ready [ 4.654737] [drm] amdgpu: 15688M of GTT memory ready. ``` If I set VRAM to 8GB in BIOS: ``` [ 4.670921] [drm] amdgpu: 8192M of VRAM memory ready [ 4.670923] [drm] amdgpu: 11908M of GTT memory ready. ``` If I set VRAM to 16GB in BIOS: ``` [ 4.600060] [drm] amdgpu: 16384M of VRAM memory ready [ 4.600062] [drm] amdgpu: 7888M of GTT memory ready. ``` It looks like GTT size is 0.5*(RAM-VRAM). I wonder how far can this go if you have 64GB or 96GB RAM. Can you have iGPU with 32GB or 48GB of GTT memory? That would make $200 APU with $200 DDR5 RAM superior to $2,000 dGPU for running Mixtral and future sparse models. I also wonder whether any BIOS offers 32GB VRAM setting if you have 64GB of RAM. Unfortunately, [ROCm does not use GTT](https://github.com/ROCm/ROCm/issues/2014). That thread mentions several workarounds ([torch-apu-helper](https://github.com/pomoke/torch-apu-helper), [force-host-alloction-APU](https://github.com/segurac/force-host-alloction-APU), [Rusticl](https://docs.mesa3d.org/rusticl.html), [unlock VRAM allocation](https://winstonhyypia.medium.com/amd-apu-how-to-modify-the-dedicated-gpu-memory-e27b75905056)), but I am not sure whether Ollama would be able to use any of them. Chances are highest in docker container where Ollama has greatest control over dependencies.
Author
Owner

@DocMAX commented on GitHub (Feb 25, 2024):

Very cool findings. Interesting you mention 96GB. I did a research and it seems thats the max. we can buy right now for SO-DIMMS. Wasn't aware it's called GTT. Let's hope someday we get support for this.
If host can't handle GTT for ROCm, then i doubt docker can't do anything about it.

https://github.com/segurac/force-host-alloction-APU looks like the best solution to me if it works. Will try in my docker containers...

[So Feb 25 21:31:38 2024] [drm] amdgpu: 512M of VRAM memory ready
[So Feb 25 21:31:38 2024] [drm] amdgpu: 31844M of GTT memory ready.

This is how much i would get :-) (64GB system)

<!-- gh-comment-id:1963043034 --> @DocMAX commented on GitHub (Feb 25, 2024): Very cool findings. Interesting you mention 96GB. I did a research and it seems thats the max. we can buy right now for SO-DIMMS. Wasn't aware it's called GTT. Let's hope someday we get support for this. If host can't handle GTT for ROCm, then i doubt docker can't do anything about it. https://github.com/segurac/force-host-alloction-APU looks like the best solution to me if it works. Will try in my docker containers... ``` [So Feb 25 21:31:38 2024] [drm] amdgpu: 512M of VRAM memory ready [So Feb 25 21:31:38 2024] [drm] amdgpu: 31844M of GTT memory ready. ``` This is how much i would get :-) (64GB system)
Author
Owner

@DocMAX commented on GitHub (Feb 25, 2024):

OK, doesn't work with ollama. Wasn't aware that it doesn't use PyTorch right?

<!-- gh-comment-id:1963085522 --> @DocMAX commented on GitHub (Feb 25, 2024): OK, doesn't work with ollama. Wasn't aware that it doesn't use PyTorch right?
Author
Owner

@chiragkrishna commented on GitHub (Feb 26, 2024):

llama.cpp supports it. thats what i was trying to do in my previous post. Support AMD Ryzen Unified Memory Architecture (UMA)

<!-- gh-comment-id:1963142999 --> @chiragkrishna commented on GitHub (Feb 26, 2024): llama.cpp supports it. thats what i was trying to do in my previous post. [Support AMD Ryzen Unified Memory Architecture (UMA)](https://github.com/pytorch/pytorch/issues/107605)
Author
Owner

@robertvazan commented on GitHub (Feb 26, 2024):

@chiragkrishna Do you mean this? https://github.com/ggerganov/llama.cpp/pull/4449

Since llama.cpp already supports UMA (GGT/GART), Ollama could perhaps include llama.cpp build with UMA enabled and use it when the conditions are right (AMD iGPU with VRAM smaller than the model).

PS: UMA support seems a bit unstable, so perhaps enable it with environment variable at first.

<!-- gh-comment-id:1963896981 --> @robertvazan commented on GitHub (Feb 26, 2024): @chiragkrishna Do you mean this? https://github.com/ggerganov/llama.cpp/pull/4449 Since llama.cpp already supports UMA (GGT/GART), Ollama could perhaps include llama.cpp build with UMA enabled and use it when the conditions are right (AMD iGPU with VRAM smaller than the model). PS: UMA support seems a bit unstable, so perhaps enable it with environment variable at first.
Author
Owner

@DocMAX commented on GitHub (Feb 26, 2024):

How does the env thing work? Like this? (Doesn't do anything btw)
LLAMA_HIP_UMA=1 HSA_OVERRIDE_GFX_VERSION=9.0.0 HCC_AMDGPU_TARGETS==gfx900 ollama start

<!-- gh-comment-id:1963989382 --> @DocMAX commented on GitHub (Feb 26, 2024): How does the env thing work? Like this? (Doesn't do anything btw) `LLAMA_HIP_UMA=1 HSA_OVERRIDE_GFX_VERSION=9.0.0 HCC_AMDGPU_TARGETS==gfx900 ollama start`
Author
Owner

@robertvazan commented on GitHub (Feb 26, 2024):

@DocMAX I don't think there's UMA support in ollama yet. It's a compile-time option in llama.cpp. The other env variables (HSA_OVERRIDE_GFX_VERSION was sufficient in my experiments) are correctly passed down to ROCm.

<!-- gh-comment-id:1963994552 --> @robertvazan commented on GitHub (Feb 26, 2024): @DocMAX I don't think there's UMA support in ollama yet. It's a compile-time option in llama.cpp. The other env variables (HSA_OVERRIDE_GFX_VERSION was sufficient in my experiments) are correctly passed down to ROCm.
Author
Owner

@chiragkrishna commented on GitHub (Feb 26, 2024):

git clone and add them here

i added this "-DLLAMA_HIP_UMA=ON" to "ollama/llm/generate/gen_linux.sh"

CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)"

now its stuck here

llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:      ROCm0 buffer size =   809.59 MiB
llm_load_tensors:        CPU buffer size =    51.27 MiB
...............................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =    44.00 MiB
llama_new_context_with_model: KV self size  =   44.00 MiB, K (f16):   22.00 MiB, V (f16):   22.00 MiB
llama_new_context_with_model:  ROCm_Host input buffer size   =     9.02 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   148.01 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     4.00 MiB
llama_new_context_with_model: graph splits (measure): 3
[1708857011] warming up the model with an empty run
<!-- gh-comment-id:1964052752 --> @chiragkrishna commented on GitHub (Feb 26, 2024): git clone and add them here > i added this "-DLLAMA_HIP_UMA=ON" to "ollama/llm/generate/gen_linux.sh" > > ``` > CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)" > ``` > > now its stuck here > > ``` > llm_load_tensors: offloading 22 repeating layers to GPU > llm_load_tensors: offloading non-repeating layers to GPU > llm_load_tensors: offloaded 23/23 layers to GPU > llm_load_tensors: ROCm0 buffer size = 809.59 MiB > llm_load_tensors: CPU buffer size = 51.27 MiB > ............................................................................... > llama_new_context_with_model: n_ctx = 2048 > llama_new_context_with_model: freq_base = 10000.0 > llama_new_context_with_model: freq_scale = 1 > llama_kv_cache_init: ROCm0 KV buffer size = 44.00 MiB > llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB > llama_new_context_with_model: ROCm_Host input buffer size = 9.02 MiB > llama_new_context_with_model: ROCm0 compute buffer size = 148.01 MiB > llama_new_context_with_model: ROCm_Host compute buffer size = 4.00 MiB > llama_new_context_with_model: graph splits (measure): 3 > [1708857011] warming up the model with an empty run > ```
Author
Owner

@dhiltgen commented on GitHub (Feb 26, 2024):

I haven't dug deeply into this yet, but from what I've seen, I believe we'll need a second variant for ROCm compiled with system/unified memory support to support modern iGPUs. Setting these flags in llama.cpp will degrade performance on discrete GPUs, but since we have a model already to support multiple variants, it shouldn't be a problem to have both.

I'm working on some refinements to amdgpu discovery to try to pivot over to pure sysfs discovery which should help here.

<!-- gh-comment-id:1964614450 --> @dhiltgen commented on GitHub (Feb 26, 2024): I haven't dug deeply into this yet, but from what I've seen, I believe we'll need a second variant for ROCm compiled with system/unified memory support to support modern iGPUs. Setting these flags in llama.cpp will degrade performance on discrete GPUs, but since we have a model already to support multiple variants, it shouldn't be a problem to have both. I'm working on some refinements to amdgpu discovery to try to pivot over to pure sysfs discovery which should help here.
Author
Owner

@DocMAX commented on GitHub (Feb 26, 2024):

CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)"

Did so, but i still get "no GPU detected"...

<!-- gh-comment-id:1964833943 --> @DocMAX commented on GitHub (Feb 26, 2024): > > CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)" Did so, but i still get "no GPU detected"...
Author
Owner

@chiragkrishna commented on GitHub (Feb 27, 2024):

build:

git clone https://github.com/ollama/ollama.git
add "-DLLAMA_HIP_UMA=ON" to "ollama/llm/generate/gen_linux.sh" to CMAKE_DEFS=
export CGO_CFLAGS="-g"
export AMDGPU_TARGETS="gfx900"
go generate ./...
go build .

run:

export HSA_OVERRIDE_GFX_VERSION="9.0.0"
./ollama serve
<!-- gh-comment-id:1965617209 --> @chiragkrishna commented on GitHub (Feb 27, 2024): build: ``` git clone https://github.com/ollama/ollama.git add "-DLLAMA_HIP_UMA=ON" to "ollama/llm/generate/gen_linux.sh" to CMAKE_DEFS= export CGO_CFLAGS="-g" export AMDGPU_TARGETS="gfx900" go generate ./... go build . ``` run: ``` export HSA_OVERRIDE_GFX_VERSION="9.0.0" ./ollama serve ```
Author
Owner

@DocMAX commented on GitHub (Feb 27, 2024):

time=2024-02-27T22:36:04.112Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-27T22:36:04.112Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-27T22:36:04.156Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-27T22:36:04.156Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-27T22:36:04.165Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.6.0.60002 /opt/rocm-6.0.2/lib/librocm_smi64.so.6.0.60002]"
time=2024-02-27T22:36:04.240Z level=INFO source=gpu.go:109 msg="Radeon GPU detected"
time=2024-02-27T22:36:04.240Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-27T22:36:04.240Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected"
time=2024-02-27T22:36:04.240Z level=INFO source=routes.go:1042 msg="no GPU detected"

Did exactly so, but not working... very strange. CPU: AMD 5800U

<!-- gh-comment-id:1967789960 --> @DocMAX commented on GitHub (Feb 27, 2024): ``` time=2024-02-27T22:36:04.112Z level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-02-27T22:36:04.112Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so" time=2024-02-27T22:36:04.156Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []" time=2024-02-27T22:36:04.156Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so" time=2024-02-27T22:36:04.165Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.6.0.60002 /opt/rocm-6.0.2/lib/librocm_smi64.so.6.0.60002]" time=2024-02-27T22:36:04.240Z level=INFO source=gpu.go:109 msg="Radeon GPU detected" time=2024-02-27T22:36:04.240Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-27T22:36:04.240Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected" time=2024-02-27T22:36:04.240Z level=INFO source=routes.go:1042 msg="no GPU detected" ``` Did exactly so, but not working... very strange. CPU: AMD 5800U
Author
Owner

@chiragkrishna commented on GitHub (Feb 28, 2024):

try playing with.

ROCR_VISIBLE_DEVICES=0 ollama serve
or 
ROCR_VISIBLE_DEVICES=1 ollama serve

ollama has few broken checks for amd integrated gpus currently

gpu.go:181 msg="ROCm unsupported integrated GPU detected"
<!-- gh-comment-id:1969276300 --> @chiragkrishna commented on GitHub (Feb 28, 2024): try playing with. ``` ROCR_VISIBLE_DEVICES=0 ollama serve or ROCR_VISIBLE_DEVICES=1 ollama serve ``` ollama has few broken checks for amd integrated gpus currently ``` gpu.go:181 msg="ROCm unsupported integrated GPU detected" ```
Author
Owner

@DocMAX commented on GitHub (Feb 28, 2024):

Nope, doesn't make a difference :-(

<!-- gh-comment-id:1969281043 --> @DocMAX commented on GitHub (Feb 28, 2024): Nope, doesn't make a difference :-(
Author
Owner

@chiragkrishna commented on GitHub (Feb 28, 2024):

change "tooOld" to this and compile and see.

ollama/gpu/gpu.go from line 173

gfx := AMDGFXVersions()
		tooOld := false
		for _, v := range gfx {
			if v.Major < 9 {
				slog.Info("AMD GPU too old, falling back to CPU " + v.ToGFXString())
				tooOld = true
				break
			}

			// TODO - remap gfx strings for unsupporetd minor/patch versions to supported for the same major
			// e.g. gfx1034 works if we map it to gfx1030 at runtime

		}
		if !tooOld {
			// TODO - this algo can be shifted over to use sysfs instead of the rocm info library...
			C.rocm_check_vram(*gpuHandles.rocm, &memInfo)
				resp.Library = "rocm"
				var version C.rocm_version_resp_t
				C.rocm_get_version(*gpuHandles.rocm, &version)
				verString := C.GoString(version.str)
				if version.status == 0 {
					resp.Variant = "v" + verString
				} else {
					slog.Info(fmt.Sprintf("failed to look up ROCm version: %s", verString))
				}
				C.free(unsafe.Pointer(version.str))
		}
	}
	if resp.Library == "" {
		C.cpu_check_ram(&memInfo)
		resp.Library = "cpu"
		resp.Variant = cpuVariant

even if your gpu is detected you will be stuck at my place i guess.

<!-- gh-comment-id:1969430837 --> @chiragkrishna commented on GitHub (Feb 28, 2024): change "tooOld" to this and compile and see. ollama/gpu/gpu.go from line 173 ```go gfx := AMDGFXVersions() tooOld := false for _, v := range gfx { if v.Major < 9 { slog.Info("AMD GPU too old, falling back to CPU " + v.ToGFXString()) tooOld = true break } // TODO - remap gfx strings for unsupporetd minor/patch versions to supported for the same major // e.g. gfx1034 works if we map it to gfx1030 at runtime } if !tooOld { // TODO - this algo can be shifted over to use sysfs instead of the rocm info library... C.rocm_check_vram(*gpuHandles.rocm, &memInfo) resp.Library = "rocm" var version C.rocm_version_resp_t C.rocm_get_version(*gpuHandles.rocm, &version) verString := C.GoString(version.str) if version.status == 0 { resp.Variant = "v" + verString } else { slog.Info(fmt.Sprintf("failed to look up ROCm version: %s", verString)) } C.free(unsafe.Pointer(version.str)) } } if resp.Library == "" { C.cpu_check_ram(&memInfo) resp.Library = "cpu" resp.Variant = cpuVariant ``` even if your gpu is detected you will be stuck at my place i guess.
Author
Owner

@DocMAX commented on GitHub (Feb 28, 2024):

Still no GPU, i give up.

<!-- gh-comment-id:1969751134 --> @DocMAX commented on GitHub (Feb 28, 2024): Still no GPU, i give up.
Author
Owner

@DocMAX commented on GitHub (Mar 8, 2024):

Something happened now...

Mar 08 01:22:47 ai ollama[10945]: time=2024-03-08T01:22:47.056Z level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx9012]"
Mar 08 01:22:47 ai ollama[10945]: time=2024-03-08T01:22:47.056Z level=INFO source=amd_linux.go:238 msg="[1] amdgpu totalMemory 536870912"
Mar 08 01:22:47 ai ollama[10945]: time=2024-03-08T01:22:47.056Z level=INFO source=amd_linux.go:239 msg="[1] amdgpu freeMemory  536870912"
Mar 08 01:22:47 ai ollama[10945]: time=2024-03-08T01:22:47.056Z level=INFO source=llm.go:111 msg="not enough vram available, falling back to CPU only"

I compiled with "-DLLAMA_HIP_UMA=ON"... So UMA still not working...

<!-- gh-comment-id:1984884399 --> @DocMAX commented on GitHub (Mar 8, 2024): Something happened now... ``` Mar 08 01:22:47 ai ollama[10945]: time=2024-03-08T01:22:47.056Z level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx9012]" Mar 08 01:22:47 ai ollama[10945]: time=2024-03-08T01:22:47.056Z level=INFO source=amd_linux.go:238 msg="[1] amdgpu totalMemory 536870912" Mar 08 01:22:47 ai ollama[10945]: time=2024-03-08T01:22:47.056Z level=INFO source=amd_linux.go:239 msg="[1] amdgpu freeMemory 536870912" Mar 08 01:22:47 ai ollama[10945]: time=2024-03-08T01:22:47.056Z level=INFO source=llm.go:111 msg="not enough vram available, falling back to CPU only" ``` I compiled with "-DLLAMA_HIP_UMA=ON"... So UMA still not working...
Author
Owner

@chiragkrishna commented on GitHub (Mar 8, 2024):

compiled just now. no luck with ryzen 5500U

ollama serve                                                                             ─╯
time=2024-03-08T07:36:58.079+05:30 level=INFO source=images.go:796 msg="total blobs: 11"
time=2024-03-08T07:36:58.079+05:30 level=INFO source=images.go:803 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/jmorganca/ollama/server.ChatHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/jmorganca/ollama/server.EmbeddingsHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/jmorganca/ollama/server.ChatHandler (6 handlers)
[GIN-debug] GET    /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
time=2024-03-08T07:36:58.079+05:30 level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.0.0)"
time=2024-03-08T07:36:58.080+05:30 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-08T07:36:58.922+05:30 level=INFO source=payload_common.go:150 msg="Dynamic LLM libraries [rocm_v6 cpu_avx2 cpu cpu_avx rocm_v60002]"
time=2024-03-08T07:36:58.922+05:30 level=DEBUG source=payload_common.go:151 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-03-08T07:36:58.922+05:30 level=INFO source=gpu.go:77 msg="Detecting GPU type"
time=2024-03-08T07:36:58.922+05:30 level=INFO source=gpu.go:191 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-08T07:36:58.922+05:30 level=DEBUG source=gpu.go:209 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /home/bunneo/libnvidia-ml.so*]"
time=2024-03-08T07:36:58.924+05:30 level=INFO source=gpu.go:237 msg="Discovered GPU libraries: []"
time=2024-03-08T07:36:58.924+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-08T07:36:58.924+05:30 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-03-08T07:36:58.924+05:30 level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx9012]"
time=2024-03-08T07:36:58.924+05:30 level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /home/bunneo/.ollama/assets/0.0.0/rocm"
time=2024-03-08T07:36:58.924+05:30 level=DEBUG source=amd_linux.go:123 msg="skipping rocm gfx compatibility check with HSA_OVERRIDE_GFX_VERSION=9.0.0"
time=2024-03-08T07:36:58.925+05:30 level=DEBUG source=amd_linux.go:171 msg="discovering amdgpu devices [1]"
time=2024-03-08T07:36:58.925+05:30 level=INFO source=amd_linux.go:238 msg="[1] amdgpu totalMemory 4294967296"
time=2024-03-08T07:36:58.925+05:30 level=INFO source=amd_linux.go:239 msg="[1] amdgpu freeMemory  4294967296"
time=2024-03-08T07:36:58.925+05:30 level=DEBUG source=gpu.go:180 msg="rocm detected 1 devices with 3072M available memory"
[GIN] 2024/03/08 - 07:37:12 | 200 |      60.553µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/03/08 - 07:37:12 | 200 |     349.625µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/03/08 - 07:37:12 | 200 |     178.304µs |       127.0.0.1 | POST     "/api/show"
time=2024-03-08T07:37:12.650+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-08T07:37:12.651+05:30 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-03-08T07:37:12.651+05:30 level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx9012]"
time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /home/bunneo/.ollama/assets/0.0.0/rocm"
time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_linux.go:123 msg="skipping rocm gfx compatibility check with HSA_OVERRIDE_GFX_VERSION=9.0.0"
time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_linux.go:171 msg="discovering amdgpu devices [1]"
time=2024-03-08T07:37:12.651+05:30 level=INFO source=amd_linux.go:238 msg="[1] amdgpu totalMemory 4294967296"
time=2024-03-08T07:37:12.651+05:30 level=INFO source=amd_linux.go:239 msg="[1] amdgpu freeMemory  4294967296"
time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=gpu.go:180 msg="rocm detected 1 devices with 3072M available memory"
time=2024-03-08T07:37:12.651+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-08T07:37:12.651+05:30 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-03-08T07:37:12.651+05:30 level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx9012]"
time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /home/bunneo/.ollama/assets/0.0.0/rocm"
time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_linux.go:123 msg="skipping rocm gfx compatibility check with HSA_OVERRIDE_GFX_VERSION=9.0.0"
time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_linux.go:171 msg="discovering amdgpu devices [1]"
time=2024-03-08T07:37:12.652+05:30 level=INFO source=amd_linux.go:238 msg="[1] amdgpu totalMemory 4294967296"
time=2024-03-08T07:37:12.652+05:30 level=INFO source=amd_linux.go:239 msg="[1] amdgpu freeMemory  4294967296"
time=2024-03-08T07:37:12.652+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-08T07:37:12.652+05:30 level=DEBUG source=payload_common.go:93 msg="ordered list of LLM libraries to try [/home/bunneo/.ollama/assets/0.0.0/rocm_v6/libext_server.so /home/bunneo/.ollama/assets/0.0.0/rocm_v60002/libext_server.so /home/bunneo/.ollama/assets/0.0.0/cpu_avx2/libext_server.so]"
loading library /home/bunneo/.ollama/assets/0.0.0/rocm_v6/libext_server.so
time=2024-03-08T07:37:12.687+05:30 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /home/bunneo/.ollama/assets/0.0.0/rocm_v6/libext_server.so"
time=2024-03-08T07:37:12.688+05:30 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
time=2024-03-08T07:37:12.688+05:30 level=DEBUG source=dyn_ext_server.go:151 msg="server params: {model:0x7e7f1c109df0 n_ctx:2048 n_batch:512 n_threads:0 n_parallel:1 rope_freq_base:0 rope_freq_scale:0 memory_f16:true n_gpu_layers:23 main_gpu:0 use_mlock:false use_mmap:true numa:0 embedding:true lora_adapters:<nil> mmproj:<nil> verbose_logging:true _:[0 0 0 0 0 0 0]}"
[1709863632] system info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
[1709863632] Performing pre-initialization of GPU
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 9.0, VMM: no
CUDA error: out of memory
  current device: 0, in function ggml_init_cublas at /home/bunneo/ollama/llm/llama.cpp/ggml-cuda.cu:8771
  hipStreamCreateWithFlags(&g_cudaStreams[id][is], 0x01)
GGML_ASSERT: /home/bunneo/ollama/llm/llama.cpp/ggml-cuda.cu:256: !"CUDA error"
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
SIGABRT: abort
PC=0x7e7f8fc969fc m=9 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 8 gp=0xc000580a80 m=9 mp=0xc000584808 [syscall]:
runtime.cgocall(0xeba490, 0xc00004c760)
	/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc00004c738 sp=0xc00004c700 pc=0x40a74b
github.com/jmorganca/ollama/llm._Cfunc_dyn_llama_server_init({0x7e7f1c001230, 0x7e7f02e82840, 0x7e7f02e83120, 0x7e7f02e831b0, 0x7e7f02e83410, 0x7e7f02e83600, 0x7e7f02e83ee0, 0x7e7f02e83eb0, 0x7e7f02e83fa0, 0x7e7f02e844f0, ...}, ...)
	_cgo_gotypes.go:290 +0x45 fp=0xc00004c760 sp=0xc00004c738 pc=0xce47a5
github.com/jmorganca/ollama/llm.newDynExtServer.func7(0xc0000a8230, 0xc00089c030)
	/home/bunneo/ollama/llm/dyn_ext_server.go:154 +0x112 fp=0xc00004c8a0 sp=0xc00004c760 pc=0xce5e52
github.com/jmorganca/ollama/llm.newDynExtServer({0xc0003c1600, 0x3a}, {0xc00055e230, _}, {_, _, _}, {0x0, 0x0, 0x0}, ...)
	/home/bunneo/ollama/llm/dyn_ext_server.go:154 +0xb50 fp=0xc00004cae8 sp=0xc00004c8a0 pc=0xce5a90
github.com/jmorganca/ollama/llm.newLlmServer({{_, _, _}, {_, _}, {_, _}}, {_, _}, {0x0, ...}, ...)
	/home/bunneo/ollama/llm/llm.go:158 +0x4c5 fp=0xc00004cca8 sp=0xc00004cae8 pc=0xce2085
github.com/jmorganca/ollama/llm.New({0xc00055e230, 0x69}, {0x0, 0x0, 0x0}, {0x0, _, _}, {{0x0, 0x800, ...}, ...})
	/home/bunneo/ollama/llm/llm.go:123 +0x76e fp=0xc00004cf18 sp=0xc00004cca8 pc=0xce194e
github.com/jmorganca/ollama/server.load(0xc000554000?, 0xc000554000, {{0x0, 0x800, 0x200, 0x1, 0xffffffffffffffff, 0x0, 0x0, 0x1, ...}, ...}, ...)
	/home/bunneo/ollama/server/routes.go:83 +0x325 fp=0xc00004d068 sp=0xc00004cf18 pc=0xe92ae5
github.com/jmorganca/ollama/server.ChatHandler(0xc0000c9600)
	/home/bunneo/ollama/server/routes.go:1173 +0xa37 fp=0xc00004d770 sp=0xc00004d068 pc=0xe9e2b7
github.com/gin-gonic/gin.(*Context).Next(...)
	/home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func1(0xc0000c9600)
	/home/bunneo/ollama/server/routes.go:943 +0x68 fp=0xc00004d7a8 sp=0xc00004d770 pc=0xe9ca48
github.com/gin-gonic/gin.(*Context).Next(...)
	/home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/gin-gonic/gin.CustomRecoveryWithWriter.func1(0xc0000c9600)
	/home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/recovery.go:102 +0x7a fp=0xc00004d7f8 sp=0xc00004d7a8 pc=0xe72a1a
github.com/gin-gonic/gin.(*Context).Next(...)
	/home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/gin-gonic/gin.LoggerWithConfig.func1(0xc0000c9600)
	/home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/logger.go:240 +0xdd fp=0xc00004d9a8 sp=0xc00004d7f8 pc=0xe71b5d
github.com/gin-gonic/gin.(*Context).Next(...)
	/home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/gin-gonic/gin.(*Engine).handleHTTPRequest(0xc0000d0340, 0xc0000c9600)
	/home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620 +0x66e fp=0xc00004db28 sp=0xc00004d9a8 pc=0xe7104e
github.com/gin-gonic/gin.(*Engine).ServeHTTP(0xc0000d0340, {0x43da148, 0xc000178000}, 0xc000172000)
	/home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576 +0x1b2 fp=0xc00004db60 sp=0xc00004db28 pc=0xe70812
net/http.serverHandler.ServeHTTP({0x43d8028?}, {0x43da148?, 0xc000178000?}, 0x6?)
	/usr/local/go/src/net/http/server.go:3137 +0x8e fp=0xc00004db90 sp=0xc00004db60 pc=0x6fe1ee
net/http.(*conn).serve(0xc0000cc090, {0x43dc508, 0xc0001af440})
	/usr/local/go/src/net/http/server.go:2039 +0x5e8 fp=0xc00004dfb8 sp=0xc00004db90 pc=0x6f95a8
net/http.(*Server).Serve.gowrap3()
	/usr/local/go/src/net/http/server.go:3285 +0x28 fp=0xc00004dfe0 sp=0xc00004dfb8 pc=0x6fea08
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00004dfe8 sp=0xc00004dfe0 pc=0x474301
created by net/http.(*Server).Serve in goroutine 1
	/usr/local/go/src/net/http/server.go:3285 +0x4b4

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0xc000054008?, 0x0?, 0xc0?, 0x61?, 0xc0005af870?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc0005af838 sp=0xc0005af818 pc=0x44162e
runtime.netpollblock(0xc0005af8d0?, 0x409ee6?, 0x0?)
	/usr/local/go/src/runtime/netpoll.go:573 +0xf7 fp=0xc0005af870 sp=0xc0005af838 pc=0x43a397
internal/poll.runtime_pollWait(0x7e7f8feabe40, 0x72)
	/usr/local/go/src/runtime/netpoll.go:345 +0x85 fp=0xc0005af890 sp=0xc0005af870 pc=0x46ea05
internal/poll.(*pollDesc).wait(0x3?, 0x3fe?, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0005af8b8 sp=0xc0005af890 pc=0x5030c7
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000525100)
	/usr/local/go/src/internal/poll/fd_unix.go:611 +0x2ac fp=0xc0005af960 sp=0xc0005af8b8 pc=0x50846c
net.(*netFD).accept(0xc000525100)
	/usr/local/go/src/net/fd_unix.go:172 +0x29 fp=0xc0005afa18 sp=0xc0005af960 pc=0x597c69
net.(*TCPListener).accept(0xc00052d660)
	/usr/local/go/src/net/tcpsock_posix.go:159 +0x1e fp=0xc0005afa40 sp=0xc0005afa18 pc=0x5acf3e
net.(*TCPListener).Accept(0xc00052d660)
	/usr/local/go/src/net/tcpsock.go:327 +0x30 fp=0xc0005afa70 sp=0xc0005afa40 pc=0x5ac130
net/http.(*onceCloseListener).Accept(0xc0000cc090?)
	<autogenerated>:1 +0x24 fp=0xc0005afa88 sp=0xc0005afa70 pc=0x720bc4
net/http.(*Server).Serve(0xc0005200f0, {0x43d9ed8, 0xc00052d660})
	/usr/local/go/src/net/http/server.go:3255 +0x33e fp=0xc0005afbb8 sp=0xc0005afa88 pc=0x6fe61e
github.com/jmorganca/ollama/server.Serve({0x43d9ed8, 0xc00052d660})
	/home/bunneo/ollama/server/routes.go:1046 +0x4ab fp=0xc0005afcc0 sp=0xc0005afbb8 pc=0xe9cf4b
github.com/jmorganca/ollama/cmd.RunServer(0xc0000c8b00?, {0x4b2f300?, 0x4?, 0x1050133?})
	/home/bunneo/ollama/cmd/cmd.go:787 +0x1b9 fp=0xc0005afd58 sp=0xc0005afcc0 pc=0xeb0c99
github.com/spf13/cobra.(*Command).execute(0xc00054b508, {0x4b2f300, 0x0, 0x0})
	/home/bunneo/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940 +0x882 fp=0xc0005afe78 sp=0xc0005afd58 pc=0x793b42
github.com/spf13/cobra.(*Command).ExecuteC(0xc00054a908)
	/home/bunneo/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc0005aff30 sp=0xc0005afe78 pc=0x794385
github.com/spf13/cobra.(*Command).Execute(...)
	/home/bunneo/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
	/home/bunneo/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
	/home/bunneo/ollama/main.go:11 +0x4d fp=0xc0005aff50 sp=0xc0005aff30 pc=0xeb8e4d
runtime.main()
	/usr/local/go/src/runtime/proc.go:271 +0x29d fp=0xc0005affe0 sp=0xc0005aff50 pc=0x4411fd
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005affe8 sp=0xc0005affe0 pc=0x474301

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000074fa8 sp=0xc000074f88 pc=0x44162e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.forcegchelper()
	/usr/local/go/src/runtime/proc.go:326 +0xb3 fp=0xc000074fe0 sp=0xc000074fa8 pc=0x4414b3
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x474301
created by runtime.init.6 in goroutine 1
	/usr/local/go/src/runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000075780 sp=0xc000075760 pc=0x44162e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.bgsweep(0xc00007c000)
	/usr/local/go/src/runtime/mgcsweep.go:318 +0xdf fp=0xc0000757c8 sp=0xc000075780 pc=0x42cbdf
runtime.gcenable.gowrap1()
	/usr/local/go/src/runtime/mgc.go:203 +0x25 fp=0xc0000757e0 sp=0xc0000757c8 pc=0x4214c5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000757e8 sp=0xc0000757e0 pc=0x474301
created by runtime.gcenable in goroutine 1
	/usr/local/go/src/runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xa479cc?, 0x9d1b96?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000075f78 sp=0xc000075f58 pc=0x44162e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.(*scavengerState).park(0x4ac92a0)
	/usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000075fa8 sp=0xc000075f78 pc=0x42a569
runtime.bgscavenge(0xc00007c000)
	/usr/local/go/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc000075fc8 sp=0xc000075fa8 pc=0x42ab19
runtime.gcenable.gowrap2()
	/usr/local/go/src/runtime/mgc.go:204 +0x25 fp=0xc000075fe0 sp=0xc000075fc8 pc=0x421465
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000075fe8 sp=0xc000075fe0 pc=0x474301
created by runtime.gcenable in goroutine 1
	/usr/local/go/src/runtime/mgc.go:204 +0xa5

goroutine 18 gp=0xc000104380 m=nil [finalizer wait]:
runtime.gopark(0xc000074648?, 0x414885?, 0xa8?, 0x1?, 0xc0000061c0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000074620 sp=0xc000074600 pc=0x44162e
runtime.runfinq()
	/usr/local/go/src/runtime/mfinal.go:194 +0x107 fp=0xc0000747e0 sp=0xc000074620 pc=0x420507
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000747e8 sp=0xc0000747e0 pc=0x474301
created by runtime.createfing in goroutine 1
	/usr/local/go/src/runtime/mfinal.go:164 +0x3d

goroutine 19 gp=0xc000105c00 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000070750 sp=0xc000070730 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000707e0 sp=0xc000070750 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000707e8 sp=0xc0000707e0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 20 gp=0xc000105dc0 m=nil [GC worker (idle)]:
runtime.gopark(0x4b312c0?, 0x1?, 0x25?, 0x15?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000070f50 sp=0xc000070f30 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000070fe0 sp=0xc000070f50 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000070fe8 sp=0xc000070fe0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 21 gp=0xc000498000 m=nil [GC worker (idle)]:
runtime.gopark(0x1f7736bb36e?, 0x1?, 0x49?, 0x69?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000071750 sp=0xc000071730 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000717e0 sp=0xc000071750 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000717e8 sp=0xc0000717e0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 5 gp=0xc000007c00 m=nil [GC worker (idle)]:
runtime.gopark(0x1f773719d05?, 0x3?, 0x1d?, 0x53?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000076750 sp=0xc000076730 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000767e0 sp=0xc000076750 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000767e8 sp=0xc0000767e0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 34 gp=0xc000500000 m=nil [GC worker (idle)]:
runtime.gopark(0x1f773719d05?, 0x1?, 0x9f?, 0x7f?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000506750 sp=0xc000506730 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0005067e0 sp=0xc000506750 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005067e8 sp=0xc0005067e0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 6 gp=0xc000007dc0 m=nil [GC worker (idle)]:
runtime.gopark(0x4b312c0?, 0x1?, 0xa3?, 0x96?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000076f50 sp=0xc000076f30 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000076fe0 sp=0xc000076f50 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000076fe8 sp=0xc000076fe0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 35 gp=0xc0005001c0 m=nil [GC worker (idle)]:
runtime.gopark(0x1f773719d05?, 0x3?, 0xb1?, 0x8?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000506f50 sp=0xc000506f30 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000506fe0 sp=0xc000506f50 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000506fe8 sp=0xc000506fe0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 36 gp=0xc000500380 m=nil [GC worker (idle)]:
runtime.gopark(0x1f773719d05?, 0x1?, 0x69?, 0xed?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000507750 sp=0xc000507730 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0005077e0 sp=0xc000507750 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005077e8 sp=0xc0005077e0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 37 gp=0xc000500540 m=nil [GC worker (idle)]:
runtime.gopark(0x1f773719d05?, 0x3?, 0xd5?, 0x79?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000507f50 sp=0xc000507f30 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000507fe0 sp=0xc000507f50 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000507fe8 sp=0xc000507fe0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 38 gp=0xc000500700 m=nil [GC worker (idle)]:
runtime.gopark(0x1f7736bd12b?, 0x3?, 0x53?, 0x4?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000508750 sp=0xc000508730 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0005087e0 sp=0xc000508750 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005087e8 sp=0xc0005087e0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 50 gp=0xc000580000 m=nil [GC worker (idle)]:
runtime.gopark(0x1f773719d05?, 0x1?, 0x8d?, 0x9?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000502750 sp=0xc000502730 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0005027e0 sp=0xc000502750 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005027e8 sp=0xc0005027e0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 39 gp=0xc0005008c0 m=nil [GC worker (idle)]:
runtime.gopark(0x1f773719d05?, 0x3?, 0x62?, 0x9c?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000508f50 sp=0xc000508f30 pc=0x44162e
runtime.gcBgMarkWorker()
	/usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000508fe0 sp=0xc000508f50 pc=0x4235a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000508fe8 sp=0xc000508fe0 pc=0x474301
created by runtime.gcBgMarkStartWorkers in goroutine 1
	/usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 7 gp=0xc000500c40 m=nil [select, locked to thread]:
runtime.gopark(0xc000505fa8?, 0x2?, 0xc9?, 0x18?, 0xc000505f94?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000505e38 sp=0xc000505e18 pc=0x44162e
runtime.selectgo(0xc000505fa8, 0xc000505f90, 0x0?, 0x0, 0x0?, 0x1)
	/usr/local/go/src/runtime/select.go:327 +0x725 fp=0xc000505f58 sp=0xc000505e38 pc=0x452a85
runtime.ensureSigM.func1()
	/usr/local/go/src/runtime/signal_unix.go:1034 +0x19f fp=0xc000505fe0 sp=0xc000505f58 pc=0x46b73f
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000505fe8 sp=0xc000505fe0 pc=0x474301
created by runtime.ensureSigM in goroutine 1
	/usr/local/go/src/runtime/signal_unix.go:1017 +0xc8

goroutine 40 gp=0xc0004981c0 m=4 mp=0xc00007b808 [syscall]:
runtime.notetsleepg(0x4b2ff80, 0xffffffffffffffff)
	/usr/local/go/src/runtime/lock_futex.go:246 +0x29 fp=0xc0005097a0 sp=0xc000509778 pc=0x412ea9
os/signal.signal_recv()
	/usr/local/go/src/runtime/sigqueue.go:152 +0x29 fp=0xc0005097c0 sp=0xc0005097a0 pc=0x470d69
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:23 +0x13 fp=0xc0005097e0 sp=0xc0005097c0 pc=0x722f73
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005097e8 sp=0xc0005097e0 pc=0x474301
created by os/signal.Notify.func1.1 in goroutine 1
	/usr/local/go/src/os/signal/signal.go:151 +0x1f

goroutine 51 gp=0xc0005808c0 m=nil [chan receive]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc00001a718 sp=0xc00001a6f8 pc=0x44162e
runtime.chanrecv(0xc0001e3860, 0x0, 0x1)
	/usr/local/go/src/runtime/chan.go:583 +0x3bf fp=0xc00001a790 sp=0xc00001a718 pc=0x40cd5f
runtime.chanrecv1(0x0?, 0x0?)
	/usr/local/go/src/runtime/chan.go:442 +0x12 fp=0xc00001a7b8 sp=0xc00001a790 pc=0x40c972
github.com/jmorganca/ollama/server.Serve.func2()
	/home/bunneo/ollama/server/routes.go:1028 +0x25 fp=0xc00001a7e0 sp=0xc00001a7b8 pc=0xe9cfe5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00001a7e8 sp=0xc00001a7e0 pc=0x474301
created by github.com/jmorganca/ollama/server.Serve in goroutine 1
	/home/bunneo/ollama/server/routes.go:1027 +0x3f6

goroutine 11 gp=0xc000580e00 m=nil [IO wait]:
runtime.gopark(0x10?, 0x10?, 0xf0?, 0x3d?, 0xb?)
	/usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000073da8 sp=0xc000073d88 pc=0x44162e
runtime.netpollblock(0x486418?, 0x409ee6?, 0x0?)
	/usr/local/go/src/runtime/netpoll.go:573 +0xf7 fp=0xc000073de0 sp=0xc000073da8 pc=0x43a397
internal/poll.runtime_pollWait(0x7e7f8feabd48, 0x72)
	/usr/local/go/src/runtime/netpoll.go:345 +0x85 fp=0xc000073e00 sp=0xc000073de0 pc=0x46ea05
internal/poll.(*pollDesc).wait(0xc000524680?, 0xc0001af631?, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000073e28 sp=0xc000073e00 pc=0x5030c7
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000524680, {0xc0001af631, 0x1, 0x1})
	/usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc000073ec0 sp=0xc000073e28 pc=0x5043ba
net.(*netFD).Read(0xc000524680, {0xc0001af631?, 0xc000073f48?, 0x470a70?})
	/usr/local/go/src/net/fd_posix.go:55 +0x25 fp=0xc000073f08 sp=0xc000073ec0 pc=0x595c85
net.(*conn).Read(0xc00057c0e8, {0xc0001af631?, 0x0?, 0x4b2f300?})
	/usr/local/go/src/net/net.go:179 +0x45 fp=0xc000073f50 sp=0xc000073f08 pc=0x5a3e85
net.(*TCPConn).Read(0x4a389e0?, {0xc0001af631?, 0x0?, 0x0?})
	<autogenerated>:1 +0x25 fp=0xc000073f80 sp=0xc000073f50 pc=0x5b5505
net/http.(*connReader).backgroundRead(0xc0001af620)
	/usr/local/go/src/net/http/server.go:681 +0x37 fp=0xc000073fc8 sp=0xc000073f80 pc=0x6f3517
net/http.(*connReader).startBackgroundRead.gowrap2()
	/usr/local/go/src/net/http/server.go:677 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x6f3445
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x474301
created by net/http.(*connReader).startBackgroundRead in goroutine 8
	/usr/local/go/src/net/http/server.go:677 +0xba

rax    0x0
rbx    0x7e7f417fa640
rcx    0x7e7f8fc969fc
rdx    0x6
rdi    0x58b1
rsi    0x58b9
rbp    0x58b9
rsp    0x7e7f417f8db0
r8     0x7e7f417f8e80
r9     0x0
r10    0x8
r11    0x246
r12    0x6
r13    0x16
r14    0x2243
r15    0x7e7f0307fa44
rip    0x7e7f8fc969fc
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
<!-- gh-comment-id:1984927019 --> @chiragkrishna commented on GitHub (Mar 8, 2024): compiled just now. no luck with ryzen 5500U ``` ollama serve ─╯ time=2024-03-08T07:36:58.079+05:30 level=INFO source=images.go:796 msg="total blobs: 11" time=2024-03-08T07:36:58.079+05:30 level=INFO source=images.go:803 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers) [GIN-debug] POST /api/generate --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers) [GIN-debug] POST /api/chat --> github.com/jmorganca/ollama/server.ChatHandler (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/jmorganca/ollama/server.EmbeddingsHandler (5 handlers) [GIN-debug] POST /api/create --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers) [GIN-debug] POST /api/push --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers) [GIN-debug] POST /api/copy --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers) [GIN-debug] POST /api/show --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/jmorganca/ollama/server.ChatHandler (6 handlers) [GIN-debug] GET / --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] GET /api/tags --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers) [GIN-debug] GET /api/version --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers) [GIN-debug] HEAD / --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers) [GIN-debug] HEAD /api/version --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers) time=2024-03-08T07:36:58.079+05:30 level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.0.0)" time=2024-03-08T07:36:58.080+05:30 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-08T07:36:58.922+05:30 level=INFO source=payload_common.go:150 msg="Dynamic LLM libraries [rocm_v6 cpu_avx2 cpu cpu_avx rocm_v60002]" time=2024-03-08T07:36:58.922+05:30 level=DEBUG source=payload_common.go:151 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-03-08T07:36:58.922+05:30 level=INFO source=gpu.go:77 msg="Detecting GPU type" time=2024-03-08T07:36:58.922+05:30 level=INFO source=gpu.go:191 msg="Searching for GPU management library libnvidia-ml.so" time=2024-03-08T07:36:58.922+05:30 level=DEBUG source=gpu.go:209 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /home/bunneo/libnvidia-ml.so*]" time=2024-03-08T07:36:58.924+05:30 level=INFO source=gpu.go:237 msg="Discovered GPU libraries: []" time=2024-03-08T07:36:58.924+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-08T07:36:58.924+05:30 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-03-08T07:36:58.924+05:30 level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx9012]" time=2024-03-08T07:36:58.924+05:30 level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /home/bunneo/.ollama/assets/0.0.0/rocm" time=2024-03-08T07:36:58.924+05:30 level=DEBUG source=amd_linux.go:123 msg="skipping rocm gfx compatibility check with HSA_OVERRIDE_GFX_VERSION=9.0.0" time=2024-03-08T07:36:58.925+05:30 level=DEBUG source=amd_linux.go:171 msg="discovering amdgpu devices [1]" time=2024-03-08T07:36:58.925+05:30 level=INFO source=amd_linux.go:238 msg="[1] amdgpu totalMemory 4294967296" time=2024-03-08T07:36:58.925+05:30 level=INFO source=amd_linux.go:239 msg="[1] amdgpu freeMemory 4294967296" time=2024-03-08T07:36:58.925+05:30 level=DEBUG source=gpu.go:180 msg="rocm detected 1 devices with 3072M available memory" [GIN] 2024/03/08 - 07:37:12 | 200 | 60.553µs | 127.0.0.1 | HEAD "/" [GIN] 2024/03/08 - 07:37:12 | 200 | 349.625µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/08 - 07:37:12 | 200 | 178.304µs | 127.0.0.1 | POST "/api/show" time=2024-03-08T07:37:12.650+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-08T07:37:12.651+05:30 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-03-08T07:37:12.651+05:30 level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx9012]" time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /home/bunneo/.ollama/assets/0.0.0/rocm" time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_linux.go:123 msg="skipping rocm gfx compatibility check with HSA_OVERRIDE_GFX_VERSION=9.0.0" time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_linux.go:171 msg="discovering amdgpu devices [1]" time=2024-03-08T07:37:12.651+05:30 level=INFO source=amd_linux.go:238 msg="[1] amdgpu totalMemory 4294967296" time=2024-03-08T07:37:12.651+05:30 level=INFO source=amd_linux.go:239 msg="[1] amdgpu freeMemory 4294967296" time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=gpu.go:180 msg="rocm detected 1 devices with 3072M available memory" time=2024-03-08T07:37:12.651+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-08T07:37:12.651+05:30 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-03-08T07:37:12.651+05:30 level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx9012]" time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /home/bunneo/.ollama/assets/0.0.0/rocm" time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_linux.go:123 msg="skipping rocm gfx compatibility check with HSA_OVERRIDE_GFX_VERSION=9.0.0" time=2024-03-08T07:37:12.651+05:30 level=DEBUG source=amd_linux.go:171 msg="discovering amdgpu devices [1]" time=2024-03-08T07:37:12.652+05:30 level=INFO source=amd_linux.go:238 msg="[1] amdgpu totalMemory 4294967296" time=2024-03-08T07:37:12.652+05:30 level=INFO source=amd_linux.go:239 msg="[1] amdgpu freeMemory 4294967296" time=2024-03-08T07:37:12.652+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-08T07:37:12.652+05:30 level=DEBUG source=payload_common.go:93 msg="ordered list of LLM libraries to try [/home/bunneo/.ollama/assets/0.0.0/rocm_v6/libext_server.so /home/bunneo/.ollama/assets/0.0.0/rocm_v60002/libext_server.so /home/bunneo/.ollama/assets/0.0.0/cpu_avx2/libext_server.so]" loading library /home/bunneo/.ollama/assets/0.0.0/rocm_v6/libext_server.so time=2024-03-08T07:37:12.687+05:30 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /home/bunneo/.ollama/assets/0.0.0/rocm_v6/libext_server.so" time=2024-03-08T07:37:12.688+05:30 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" time=2024-03-08T07:37:12.688+05:30 level=DEBUG source=dyn_ext_server.go:151 msg="server params: {model:0x7e7f1c109df0 n_ctx:2048 n_batch:512 n_threads:0 n_parallel:1 rope_freq_base:0 rope_freq_scale:0 memory_f16:true n_gpu_layers:23 main_gpu:0 use_mlock:false use_mmap:true numa:0 embedding:true lora_adapters:<nil> mmproj:<nil> verbose_logging:true _:[0 0 0 0 0 0 0]}" [1709863632] system info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | [1709863632] Performing pre-initialization of GPU ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 ROCm devices: Device 0: AMD Radeon Graphics, compute capability 9.0, VMM: no CUDA error: out of memory current device: 0, in function ggml_init_cublas at /home/bunneo/ollama/llm/llama.cpp/ggml-cuda.cu:8771 hipStreamCreateWithFlags(&g_cudaStreams[id][is], 0x01) GGML_ASSERT: /home/bunneo/ollama/llm/llama.cpp/ggml-cuda.cu:256: !"CUDA error" Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. SIGABRT: abort PC=0x7e7f8fc969fc m=9 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 8 gp=0xc000580a80 m=9 mp=0xc000584808 [syscall]: runtime.cgocall(0xeba490, 0xc00004c760) /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc00004c738 sp=0xc00004c700 pc=0x40a74b github.com/jmorganca/ollama/llm._Cfunc_dyn_llama_server_init({0x7e7f1c001230, 0x7e7f02e82840, 0x7e7f02e83120, 0x7e7f02e831b0, 0x7e7f02e83410, 0x7e7f02e83600, 0x7e7f02e83ee0, 0x7e7f02e83eb0, 0x7e7f02e83fa0, 0x7e7f02e844f0, ...}, ...) _cgo_gotypes.go:290 +0x45 fp=0xc00004c760 sp=0xc00004c738 pc=0xce47a5 github.com/jmorganca/ollama/llm.newDynExtServer.func7(0xc0000a8230, 0xc00089c030) /home/bunneo/ollama/llm/dyn_ext_server.go:154 +0x112 fp=0xc00004c8a0 sp=0xc00004c760 pc=0xce5e52 github.com/jmorganca/ollama/llm.newDynExtServer({0xc0003c1600, 0x3a}, {0xc00055e230, _}, {_, _, _}, {0x0, 0x0, 0x0}, ...) /home/bunneo/ollama/llm/dyn_ext_server.go:154 +0xb50 fp=0xc00004cae8 sp=0xc00004c8a0 pc=0xce5a90 github.com/jmorganca/ollama/llm.newLlmServer({{_, _, _}, {_, _}, {_, _}}, {_, _}, {0x0, ...}, ...) /home/bunneo/ollama/llm/llm.go:158 +0x4c5 fp=0xc00004cca8 sp=0xc00004cae8 pc=0xce2085 github.com/jmorganca/ollama/llm.New({0xc00055e230, 0x69}, {0x0, 0x0, 0x0}, {0x0, _, _}, {{0x0, 0x800, ...}, ...}) /home/bunneo/ollama/llm/llm.go:123 +0x76e fp=0xc00004cf18 sp=0xc00004cca8 pc=0xce194e github.com/jmorganca/ollama/server.load(0xc000554000?, 0xc000554000, {{0x0, 0x800, 0x200, 0x1, 0xffffffffffffffff, 0x0, 0x0, 0x1, ...}, ...}, ...) /home/bunneo/ollama/server/routes.go:83 +0x325 fp=0xc00004d068 sp=0xc00004cf18 pc=0xe92ae5 github.com/jmorganca/ollama/server.ChatHandler(0xc0000c9600) /home/bunneo/ollama/server/routes.go:1173 +0xa37 fp=0xc00004d770 sp=0xc00004d068 pc=0xe9e2b7 github.com/gin-gonic/gin.(*Context).Next(...) /home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func1(0xc0000c9600) /home/bunneo/ollama/server/routes.go:943 +0x68 fp=0xc00004d7a8 sp=0xc00004d770 pc=0xe9ca48 github.com/gin-gonic/gin.(*Context).Next(...) /home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 github.com/gin-gonic/gin.CustomRecoveryWithWriter.func1(0xc0000c9600) /home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/recovery.go:102 +0x7a fp=0xc00004d7f8 sp=0xc00004d7a8 pc=0xe72a1a github.com/gin-gonic/gin.(*Context).Next(...) /home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 github.com/gin-gonic/gin.LoggerWithConfig.func1(0xc0000c9600) /home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/logger.go:240 +0xdd fp=0xc00004d9a8 sp=0xc00004d7f8 pc=0xe71b5d github.com/gin-gonic/gin.(*Context).Next(...) /home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 github.com/gin-gonic/gin.(*Engine).handleHTTPRequest(0xc0000d0340, 0xc0000c9600) /home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620 +0x66e fp=0xc00004db28 sp=0xc00004d9a8 pc=0xe7104e github.com/gin-gonic/gin.(*Engine).ServeHTTP(0xc0000d0340, {0x43da148, 0xc000178000}, 0xc000172000) /home/bunneo/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576 +0x1b2 fp=0xc00004db60 sp=0xc00004db28 pc=0xe70812 net/http.serverHandler.ServeHTTP({0x43d8028?}, {0x43da148?, 0xc000178000?}, 0x6?) /usr/local/go/src/net/http/server.go:3137 +0x8e fp=0xc00004db90 sp=0xc00004db60 pc=0x6fe1ee net/http.(*conn).serve(0xc0000cc090, {0x43dc508, 0xc0001af440}) /usr/local/go/src/net/http/server.go:2039 +0x5e8 fp=0xc00004dfb8 sp=0xc00004db90 pc=0x6f95a8 net/http.(*Server).Serve.gowrap3() /usr/local/go/src/net/http/server.go:3285 +0x28 fp=0xc00004dfe0 sp=0xc00004dfb8 pc=0x6fea08 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00004dfe8 sp=0xc00004dfe0 pc=0x474301 created by net/http.(*Server).Serve in goroutine 1 /usr/local/go/src/net/http/server.go:3285 +0x4b4 goroutine 1 gp=0xc0000061c0 m=nil [IO wait]: runtime.gopark(0xc000054008?, 0x0?, 0xc0?, 0x61?, 0xc0005af870?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc0005af838 sp=0xc0005af818 pc=0x44162e runtime.netpollblock(0xc0005af8d0?, 0x409ee6?, 0x0?) /usr/local/go/src/runtime/netpoll.go:573 +0xf7 fp=0xc0005af870 sp=0xc0005af838 pc=0x43a397 internal/poll.runtime_pollWait(0x7e7f8feabe40, 0x72) /usr/local/go/src/runtime/netpoll.go:345 +0x85 fp=0xc0005af890 sp=0xc0005af870 pc=0x46ea05 internal/poll.(*pollDesc).wait(0x3?, 0x3fe?, 0x0) /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0005af8b8 sp=0xc0005af890 pc=0x5030c7 internal/poll.(*pollDesc).waitRead(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc000525100) /usr/local/go/src/internal/poll/fd_unix.go:611 +0x2ac fp=0xc0005af960 sp=0xc0005af8b8 pc=0x50846c net.(*netFD).accept(0xc000525100) /usr/local/go/src/net/fd_unix.go:172 +0x29 fp=0xc0005afa18 sp=0xc0005af960 pc=0x597c69 net.(*TCPListener).accept(0xc00052d660) /usr/local/go/src/net/tcpsock_posix.go:159 +0x1e fp=0xc0005afa40 sp=0xc0005afa18 pc=0x5acf3e net.(*TCPListener).Accept(0xc00052d660) /usr/local/go/src/net/tcpsock.go:327 +0x30 fp=0xc0005afa70 sp=0xc0005afa40 pc=0x5ac130 net/http.(*onceCloseListener).Accept(0xc0000cc090?) <autogenerated>:1 +0x24 fp=0xc0005afa88 sp=0xc0005afa70 pc=0x720bc4 net/http.(*Server).Serve(0xc0005200f0, {0x43d9ed8, 0xc00052d660}) /usr/local/go/src/net/http/server.go:3255 +0x33e fp=0xc0005afbb8 sp=0xc0005afa88 pc=0x6fe61e github.com/jmorganca/ollama/server.Serve({0x43d9ed8, 0xc00052d660}) /home/bunneo/ollama/server/routes.go:1046 +0x4ab fp=0xc0005afcc0 sp=0xc0005afbb8 pc=0xe9cf4b github.com/jmorganca/ollama/cmd.RunServer(0xc0000c8b00?, {0x4b2f300?, 0x4?, 0x1050133?}) /home/bunneo/ollama/cmd/cmd.go:787 +0x1b9 fp=0xc0005afd58 sp=0xc0005afcc0 pc=0xeb0c99 github.com/spf13/cobra.(*Command).execute(0xc00054b508, {0x4b2f300, 0x0, 0x0}) /home/bunneo/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940 +0x882 fp=0xc0005afe78 sp=0xc0005afd58 pc=0x793b42 github.com/spf13/cobra.(*Command).ExecuteC(0xc00054a908) /home/bunneo/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc0005aff30 sp=0xc0005afe78 pc=0x794385 github.com/spf13/cobra.(*Command).Execute(...) /home/bunneo/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992 github.com/spf13/cobra.(*Command).ExecuteContext(...) /home/bunneo/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:985 main.main() /home/bunneo/ollama/main.go:11 +0x4d fp=0xc0005aff50 sp=0xc0005aff30 pc=0xeb8e4d runtime.main() /usr/local/go/src/runtime/proc.go:271 +0x29d fp=0xc0005affe0 sp=0xc0005aff50 pc=0x4411fd runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005affe8 sp=0xc0005affe0 pc=0x474301 goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000074fa8 sp=0xc000074f88 pc=0x44162e runtime.goparkunlock(...) /usr/local/go/src/runtime/proc.go:408 runtime.forcegchelper() /usr/local/go/src/runtime/proc.go:326 +0xb3 fp=0xc000074fe0 sp=0xc000074fa8 pc=0x4414b3 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x474301 created by runtime.init.6 in goroutine 1 /usr/local/go/src/runtime/proc.go:314 +0x1a goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000075780 sp=0xc000075760 pc=0x44162e runtime.goparkunlock(...) /usr/local/go/src/runtime/proc.go:408 runtime.bgsweep(0xc00007c000) /usr/local/go/src/runtime/mgcsweep.go:318 +0xdf fp=0xc0000757c8 sp=0xc000075780 pc=0x42cbdf runtime.gcenable.gowrap1() /usr/local/go/src/runtime/mgc.go:203 +0x25 fp=0xc0000757e0 sp=0xc0000757c8 pc=0x4214c5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000757e8 sp=0xc0000757e0 pc=0x474301 created by runtime.gcenable in goroutine 1 /usr/local/go/src/runtime/mgc.go:203 +0x66 goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]: runtime.gopark(0xa479cc?, 0x9d1b96?, 0x0?, 0x0?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000075f78 sp=0xc000075f58 pc=0x44162e runtime.goparkunlock(...) /usr/local/go/src/runtime/proc.go:408 runtime.(*scavengerState).park(0x4ac92a0) /usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000075fa8 sp=0xc000075f78 pc=0x42a569 runtime.bgscavenge(0xc00007c000) /usr/local/go/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc000075fc8 sp=0xc000075fa8 pc=0x42ab19 runtime.gcenable.gowrap2() /usr/local/go/src/runtime/mgc.go:204 +0x25 fp=0xc000075fe0 sp=0xc000075fc8 pc=0x421465 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000075fe8 sp=0xc000075fe0 pc=0x474301 created by runtime.gcenable in goroutine 1 /usr/local/go/src/runtime/mgc.go:204 +0xa5 goroutine 18 gp=0xc000104380 m=nil [finalizer wait]: runtime.gopark(0xc000074648?, 0x414885?, 0xa8?, 0x1?, 0xc0000061c0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000074620 sp=0xc000074600 pc=0x44162e runtime.runfinq() /usr/local/go/src/runtime/mfinal.go:194 +0x107 fp=0xc0000747e0 sp=0xc000074620 pc=0x420507 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000747e8 sp=0xc0000747e0 pc=0x474301 created by runtime.createfing in goroutine 1 /usr/local/go/src/runtime/mfinal.go:164 +0x3d goroutine 19 gp=0xc000105c00 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000070750 sp=0xc000070730 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000707e0 sp=0xc000070750 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000707e8 sp=0xc0000707e0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 20 gp=0xc000105dc0 m=nil [GC worker (idle)]: runtime.gopark(0x4b312c0?, 0x1?, 0x25?, 0x15?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000070f50 sp=0xc000070f30 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000070fe0 sp=0xc000070f50 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000070fe8 sp=0xc000070fe0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 21 gp=0xc000498000 m=nil [GC worker (idle)]: runtime.gopark(0x1f7736bb36e?, 0x1?, 0x49?, 0x69?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000071750 sp=0xc000071730 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000717e0 sp=0xc000071750 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000717e8 sp=0xc0000717e0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 5 gp=0xc000007c00 m=nil [GC worker (idle)]: runtime.gopark(0x1f773719d05?, 0x3?, 0x1d?, 0x53?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000076750 sp=0xc000076730 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000767e0 sp=0xc000076750 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000767e8 sp=0xc0000767e0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 34 gp=0xc000500000 m=nil [GC worker (idle)]: runtime.gopark(0x1f773719d05?, 0x1?, 0x9f?, 0x7f?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000506750 sp=0xc000506730 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0005067e0 sp=0xc000506750 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005067e8 sp=0xc0005067e0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 6 gp=0xc000007dc0 m=nil [GC worker (idle)]: runtime.gopark(0x4b312c0?, 0x1?, 0xa3?, 0x96?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000076f50 sp=0xc000076f30 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000076fe0 sp=0xc000076f50 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000076fe8 sp=0xc000076fe0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 35 gp=0xc0005001c0 m=nil [GC worker (idle)]: runtime.gopark(0x1f773719d05?, 0x3?, 0xb1?, 0x8?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000506f50 sp=0xc000506f30 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000506fe0 sp=0xc000506f50 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000506fe8 sp=0xc000506fe0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 36 gp=0xc000500380 m=nil [GC worker (idle)]: runtime.gopark(0x1f773719d05?, 0x1?, 0x69?, 0xed?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000507750 sp=0xc000507730 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0005077e0 sp=0xc000507750 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005077e8 sp=0xc0005077e0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 37 gp=0xc000500540 m=nil [GC worker (idle)]: runtime.gopark(0x1f773719d05?, 0x3?, 0xd5?, 0x79?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000507f50 sp=0xc000507f30 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000507fe0 sp=0xc000507f50 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000507fe8 sp=0xc000507fe0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 38 gp=0xc000500700 m=nil [GC worker (idle)]: runtime.gopark(0x1f7736bd12b?, 0x3?, 0x53?, 0x4?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000508750 sp=0xc000508730 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0005087e0 sp=0xc000508750 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005087e8 sp=0xc0005087e0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 50 gp=0xc000580000 m=nil [GC worker (idle)]: runtime.gopark(0x1f773719d05?, 0x1?, 0x8d?, 0x9?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000502750 sp=0xc000502730 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0005027e0 sp=0xc000502750 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005027e8 sp=0xc0005027e0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 39 gp=0xc0005008c0 m=nil [GC worker (idle)]: runtime.gopark(0x1f773719d05?, 0x3?, 0x62?, 0x9c?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000508f50 sp=0xc000508f30 pc=0x44162e runtime.gcBgMarkWorker() /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000508fe0 sp=0xc000508f50 pc=0x4235a5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000508fe8 sp=0xc000508fe0 pc=0x474301 created by runtime.gcBgMarkStartWorkers in goroutine 1 /usr/local/go/src/runtime/mgc.go:1234 +0x1c goroutine 7 gp=0xc000500c40 m=nil [select, locked to thread]: runtime.gopark(0xc000505fa8?, 0x2?, 0xc9?, 0x18?, 0xc000505f94?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000505e38 sp=0xc000505e18 pc=0x44162e runtime.selectgo(0xc000505fa8, 0xc000505f90, 0x0?, 0x0, 0x0?, 0x1) /usr/local/go/src/runtime/select.go:327 +0x725 fp=0xc000505f58 sp=0xc000505e38 pc=0x452a85 runtime.ensureSigM.func1() /usr/local/go/src/runtime/signal_unix.go:1034 +0x19f fp=0xc000505fe0 sp=0xc000505f58 pc=0x46b73f runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000505fe8 sp=0xc000505fe0 pc=0x474301 created by runtime.ensureSigM in goroutine 1 /usr/local/go/src/runtime/signal_unix.go:1017 +0xc8 goroutine 40 gp=0xc0004981c0 m=4 mp=0xc00007b808 [syscall]: runtime.notetsleepg(0x4b2ff80, 0xffffffffffffffff) /usr/local/go/src/runtime/lock_futex.go:246 +0x29 fp=0xc0005097a0 sp=0xc000509778 pc=0x412ea9 os/signal.signal_recv() /usr/local/go/src/runtime/sigqueue.go:152 +0x29 fp=0xc0005097c0 sp=0xc0005097a0 pc=0x470d69 os/signal.loop() /usr/local/go/src/os/signal/signal_unix.go:23 +0x13 fp=0xc0005097e0 sp=0xc0005097c0 pc=0x722f73 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005097e8 sp=0xc0005097e0 pc=0x474301 created by os/signal.Notify.func1.1 in goroutine 1 /usr/local/go/src/os/signal/signal.go:151 +0x1f goroutine 51 gp=0xc0005808c0 m=nil [chan receive]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc00001a718 sp=0xc00001a6f8 pc=0x44162e runtime.chanrecv(0xc0001e3860, 0x0, 0x1) /usr/local/go/src/runtime/chan.go:583 +0x3bf fp=0xc00001a790 sp=0xc00001a718 pc=0x40cd5f runtime.chanrecv1(0x0?, 0x0?) /usr/local/go/src/runtime/chan.go:442 +0x12 fp=0xc00001a7b8 sp=0xc00001a790 pc=0x40c972 github.com/jmorganca/ollama/server.Serve.func2() /home/bunneo/ollama/server/routes.go:1028 +0x25 fp=0xc00001a7e0 sp=0xc00001a7b8 pc=0xe9cfe5 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00001a7e8 sp=0xc00001a7e0 pc=0x474301 created by github.com/jmorganca/ollama/server.Serve in goroutine 1 /home/bunneo/ollama/server/routes.go:1027 +0x3f6 goroutine 11 gp=0xc000580e00 m=nil [IO wait]: runtime.gopark(0x10?, 0x10?, 0xf0?, 0x3d?, 0xb?) /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000073da8 sp=0xc000073d88 pc=0x44162e runtime.netpollblock(0x486418?, 0x409ee6?, 0x0?) /usr/local/go/src/runtime/netpoll.go:573 +0xf7 fp=0xc000073de0 sp=0xc000073da8 pc=0x43a397 internal/poll.runtime_pollWait(0x7e7f8feabd48, 0x72) /usr/local/go/src/runtime/netpoll.go:345 +0x85 fp=0xc000073e00 sp=0xc000073de0 pc=0x46ea05 internal/poll.(*pollDesc).wait(0xc000524680?, 0xc0001af631?, 0x0) /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000073e28 sp=0xc000073e00 pc=0x5030c7 internal/poll.(*pollDesc).waitRead(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc000524680, {0xc0001af631, 0x1, 0x1}) /usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc000073ec0 sp=0xc000073e28 pc=0x5043ba net.(*netFD).Read(0xc000524680, {0xc0001af631?, 0xc000073f48?, 0x470a70?}) /usr/local/go/src/net/fd_posix.go:55 +0x25 fp=0xc000073f08 sp=0xc000073ec0 pc=0x595c85 net.(*conn).Read(0xc00057c0e8, {0xc0001af631?, 0x0?, 0x4b2f300?}) /usr/local/go/src/net/net.go:179 +0x45 fp=0xc000073f50 sp=0xc000073f08 pc=0x5a3e85 net.(*TCPConn).Read(0x4a389e0?, {0xc0001af631?, 0x0?, 0x0?}) <autogenerated>:1 +0x25 fp=0xc000073f80 sp=0xc000073f50 pc=0x5b5505 net/http.(*connReader).backgroundRead(0xc0001af620) /usr/local/go/src/net/http/server.go:681 +0x37 fp=0xc000073fc8 sp=0xc000073f80 pc=0x6f3517 net/http.(*connReader).startBackgroundRead.gowrap2() /usr/local/go/src/net/http/server.go:677 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x6f3445 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x474301 created by net/http.(*connReader).startBackgroundRead in goroutine 8 /usr/local/go/src/net/http/server.go:677 +0xba rax 0x0 rbx 0x7e7f417fa640 rcx 0x7e7f8fc969fc rdx 0x6 rdi 0x58b1 rsi 0x58b9 rbp 0x58b9 rsp 0x7e7f417f8db0 r8 0x7e7f417f8e80 r9 0x0 r10 0x8 r11 0x246 r12 0x6 r13 0x16 r14 0x2243 r15 0x7e7f0307fa44 rip 0x7e7f8fc969fc rflags 0x246 cs 0x33 fs 0x0 gs 0x0 ```
Author
Owner

@robertvazan commented on GitHub (Mar 9, 2024):

Last update of the docker image introduced upgrade to ROCm 6.0, which dropped support for gfx900, so now Ryzen 5600G does not work even with HSA_OVERRIDE_GFX_VERSION. AMD screwed us. Last working version is 0.1.27 with ROCm 5.7.

@dhiltgen promised support for multiple ROCm versions. I am looking forward to it.

<!-- gh-comment-id:1986725862 --> @robertvazan commented on GitHub (Mar 9, 2024): Last update of the docker image introduced upgrade to ROCm 6.0, which dropped support for gfx900, so now Ryzen 5600G does not work even with `HSA_OVERRIDE_GFX_VERSION`. AMD screwed us. Last working version is 0.1.27 with ROCm 5.7. @dhiltgen promised support for multiple ROCm versions. I am looking forward to it.
Author
Owner

@robertvazan commented on GitHub (Mar 9, 2024):

Also looking forward to Vulkan support (#2033, #2578), which looks like a better solution than ROCm.

<!-- gh-comment-id:1986727126 --> @robertvazan commented on GitHub (Mar 9, 2024): Also looking forward to Vulkan support (#2033, #2578), which looks like a better solution than ROCm.
Author
Owner

@kirel commented on GitHub (Mar 9, 2024):

My AMD Ryzen 7 7840HS w/ Radeon 780M Graphics works great with HSA_OVERRIDE_GFX_VERSION=11.0.0 - I set VRAM to UM_SPECIFIED and 16G (I have 32G of RAM) in the Bios of my minisforum um780xtx mini PC.

<!-- gh-comment-id:1986847746 --> @kirel commented on GitHub (Mar 9, 2024): My AMD Ryzen 7 7840HS w/ Radeon 780M Graphics works great with `HSA_OVERRIDE_GFX_VERSION=11.0.0` - I set VRAM to UM_SPECIFIED and 16G (I have 32G of RAM) in the Bios of my minisforum um780xtx mini PC.
Author
Owner

@DocMAX commented on GitHub (Mar 9, 2024):

Yeah, sure that works. But this is about getting Ollama run with UMA memory and "auto" mode in BIOS!

<!-- gh-comment-id:1986848180 --> @DocMAX commented on GitHub (Mar 9, 2024): Yeah, sure that works. But this is about getting Ollama run with UMA memory and "auto" mode in BIOS!
Author
Owner

@robertvazan commented on GitHub (Mar 9, 2024):

@kirel Your iGPU is RDNA3, which is still supported by ROCm. ROCm definitely works, it's just that they deprecate hardware really quickly (my CPU is 6 months old). Vulkan will hopefully provide wider and more long-lived support without any hacks.

<!-- gh-comment-id:1986884553 --> @robertvazan commented on GitHub (Mar 9, 2024): @kirel Your iGPU is RDNA3, which is still supported by ROCm. ROCm definitely works, it's just that they deprecate hardware really quickly (my CPU is 6 months old). Vulkan will hopefully provide wider and more long-lived support without any hacks.
Author
Owner

@taweili commented on GitHub (Mar 10, 2024):

Yeah, sure that works. But this is about getting Ollama run with UMA memory and "auto" mode in BIOS!

I managed to get Ollama and llama.cpp to run on 5700G with export HSA_ENABLE_SDMA=0. The performance gain isn't much but I am also looking into the hipHostMalloc hack. You can see more info here.

<!-- gh-comment-id:1987233350 --> @taweili commented on GitHub (Mar 10, 2024): > Yeah, sure that works. But this is about getting Ollama run with UMA memory and "auto" mode in BIOS! I managed to get Ollama and llama.cpp to run on 5700G with `export HSA_ENABLE_SDMA=0.` The performance gain isn't much but I am also looking into the `hipHostMalloc` hack. You can see more info [here](https://github.com/ROCm/ROCm/issues/2774).
Author
Owner

@robertvazan commented on GitHub (Mar 11, 2024):

I managed to get Ollama and llama.cpp to run on 5700G with export HSA_ENABLE_SDMA=0.

I can confirm this works with Ryzen 5600G and ROCm 6.0.

It would be ideal to have these overrides stored centrally in ROCm, llama.cpp, or Ollama code.

<!-- gh-comment-id:1987515924 --> @robertvazan commented on GitHub (Mar 11, 2024): > I managed to get Ollama and llama.cpp to run on 5700G with `export HSA_ENABLE_SDMA=0.` I can confirm this works with Ryzen 5600G and ROCm 6.0. It would be ideal to have these overrides stored centrally in ROCm, llama.cpp, or Ollama code.
Author
Owner

@DocMAX commented on GitHub (Mar 12, 2024):

Doesn't work yet: "not enough vram available, falling back to CPU only"

<!-- gh-comment-id:1992675331 --> @DocMAX commented on GitHub (Mar 12, 2024): Doesn't work yet: "not enough vram available, falling back to CPU only"
Author
Owner

@ddpasa commented on GitHub (Mar 13, 2024):

vulkan can really help here: https://github.com/ollama/ollama/pull/2578

llama.cpp has some vulkan support, but it's in very early stages. You can try the PR above if that helps.

<!-- gh-comment-id:1995662016 --> @ddpasa commented on GitHub (Mar 13, 2024): vulkan can really help here: https://github.com/ollama/ollama/pull/2578 llama.cpp has some vulkan support, but it's in very early stages. You can try the PR above if that helps.
Author
Owner

@DocMAX commented on GitHub (Mar 16, 2024):

time=2024-03-16T10:49:09.847Z level=INFO source=amd_linux.go:217 msg="amdgpu [0] appears to be an iGPU with 512M reported total memory, skipping"
time=2024-03-16T10:49:09.847Z level=INFO source=routes.go:1133 msg="no GPU detected"

so any updates on this? how can GTT memory be enabled?

<!-- gh-comment-id:2001947040 --> @DocMAX commented on GitHub (Mar 16, 2024): ``` time=2024-03-16T10:49:09.847Z level=INFO source=amd_linux.go:217 msg="amdgpu [0] appears to be an iGPU with 512M reported total memory, skipping" time=2024-03-16T10:49:09.847Z level=INFO source=routes.go:1133 msg="no GPU detected" ``` so any updates on this? how can GTT memory be enabled?
Author
Owner

@ericcurtin commented on GitHub (Apr 12, 2024):

Hit this also 😄

time=2024-04-12T10:30:40.018Z level=INFO source=gpu.go:340 msg="Unable to load cudart CUDA management library /tmp/ollama3271263948/runners/cuda_v11/libcudart.so.11.0: cudart init failure: 35"
time=2024-04-12T10:30:40.018Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-04-12T10:30:40.018Z level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /opt/rocm/lib/libnvidia-ml.so* /usr/local/lib/libnvidia-ml.so* /opt/rh/devtoolset-7/root/libnvidia-ml.so*]"
time=2024-04-12T10:30:40.019Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-04-12T10:30:40.019Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-12T10:30:40.019Z level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-04-12T10:30:40.019Z level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx9012]"
time=2024-04-12T10:30:40.019Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /tmp/ollama3271263948/rocm"
time=2024-04-12T10:30:40.019Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/bin"
time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/bin/rocm"
time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/share/ollama/lib/rocm"
time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /opt/rocm/lib"
time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_linux.go:279 msg="host rocm linked /opt/rocm/lib => /tmp/ollama3271263948/rocm"
time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_linux.go:123 msg="skipping rocm gfx compatibility check with HSA_OVERRIDE_GFX_VERSION=11.0.0"
time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_linux.go:152 msg="discovering VRAM for amdgpu devices"
time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_linux.go:171 msg="amdgpu devices [0]"
time=2024-04-12T10:30:40.021Z level=INFO source=amd_linux.go:217 msg="amdgpu [0] appears to be an iGPU with 512M reported total memory, skipping"
time=2024-04-12T10:30:40.021Z level=INFO source=routes.go:1141 msg="no GPU detected"

on a octa core "AMD Ryzen 7 5700U with Radeon Graphics"

Maybe I should buy myself a GPU :)

<!-- gh-comment-id:2051502140 --> @ericcurtin commented on GitHub (Apr 12, 2024): Hit this also :smile: ``` time=2024-04-12T10:30:40.018Z level=INFO source=gpu.go:340 msg="Unable to load cudart CUDA management library /tmp/ollama3271263948/runners/cuda_v11/libcudart.so.11.0: cudart init failure: 35" time=2024-04-12T10:30:40.018Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so" time=2024-04-12T10:30:40.018Z level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /opt/rocm/lib/libnvidia-ml.so* /usr/local/lib/libnvidia-ml.so* /opt/rh/devtoolset-7/root/libnvidia-ml.so*]" time=2024-04-12T10:30:40.019Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []" time=2024-04-12T10:30:40.019Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-12T10:30:40.019Z level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-04-12T10:30:40.019Z level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx9012]" time=2024-04-12T10:30:40.019Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /tmp/ollama3271263948/rocm" time=2024-04-12T10:30:40.019Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/bin" time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/bin/rocm" time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/share/ollama/lib/rocm" time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /opt/rocm/lib" time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_linux.go:279 msg="host rocm linked /opt/rocm/lib => /tmp/ollama3271263948/rocm" time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_linux.go:123 msg="skipping rocm gfx compatibility check with HSA_OVERRIDE_GFX_VERSION=11.0.0" time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_linux.go:152 msg="discovering VRAM for amdgpu devices" time=2024-04-12T10:30:40.020Z level=DEBUG source=amd_linux.go:171 msg="amdgpu devices [0]" time=2024-04-12T10:30:40.021Z level=INFO source=amd_linux.go:217 msg="amdgpu [0] appears to be an iGPU with 512M reported total memory, skipping" time=2024-04-12T10:30:40.021Z level=INFO source=routes.go:1141 msg="no GPU detected" ``` on a octa core "AMD Ryzen 7 5700U with Radeon Graphics" Maybe I should buy myself a GPU :)
Author
Owner

@ericcurtin commented on GitHub (Apr 12, 2024):

Has anybody experience with connecting something like:

a "Radeon RX 7600" GPU

to a PCIe slot designed for an NVMe drive? Any recommendations for an adapter?

Wanna learn about with ollama AI :)

<!-- gh-comment-id:2051507825 --> @ericcurtin commented on GitHub (Apr 12, 2024): Has anybody experience with connecting something like: a "Radeon RX 7600" GPU to a PCIe slot designed for an NVMe drive? Any recommendations for an adapter? Wanna learn about with ollama AI :)
Author
Owner

@qkiel commented on GitHub (Apr 20, 2024):

@DocMAX thanks for the tip on the compilation. It works with one additional line of code.

Let me explain the process from the start, so others can follow. I have AMD 5600G APU and use Ubuntu 22.04.

Compiling Ollama requires newer versions of cmake and go than the ones available in Ubuntu 22.04:

  • cmake version 3.24 or higher
  • go version 1.22 or higher
  • gcc version 11.4.0 or higher
  • ROCm 6.0.3 or 6.1
  • libclblast for AMD

First install ROCm using official instructions. I'm using version 6.1 even though officially it no longer supports GCN-based iGPUs.

Next, install some required packages:

sudo apt install git ccache libclblast-dev make

And finally install cmake and go from official pages:

wget https://go.dev/dl/go1.22.3.linux-amd64.tar.gz
tar -xzf go1.22.3.linux-amd64.tar.gz

wget https://github.com/Kitware/CMake/releases/download/v3.29.3/cmake-3.29.3-linux-x86_64.tar.gz
tar -xzf cmake-3.29.3-linux-x86_64.tar.gz

We need to add extracted directories to the PATH. Open the .profile with text editor and add this line at the end (/home/ubuntu/ depends on your user, so change it accordingly):

export PATH=$PATH:/home/ubuntu/go/bin:/home/ubuntu/cmake-3.29.3-linux-x86_64/bin

Now use source ~/.profile command to make sure environment variable is set.

Getting Ollama source code is simple, use git clone command with a tag of the latest release:

git clone --depth 1 --branch v0.1.38 https://github.com/ollama/ollama

Let's make two changes in the source code. In the ollama/llm/generate/gen_linux.sh file, find a line that begins with if [ -d "${ROCM_PATH}" ]; then. A few lines under it, there is a line that begins with CMAKE_DEFS=.

Because Ollama uses llama.cpp under the hood, we can add there environment variables required for APU. In my case these are -DLLAMA_HIP_UMA=on -DHSA_ENABLE_SDMA=off -DHSA_OVERRIDE_GFX_VERSION=9.0.0. I also changed both -DAMDGPU_TARGETS=$(amdGPUs) and -DGPU_TARGETS=$(amdGPUs) to gfx900 (this value depends on your iGPU of course). It should look like this:

CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=on -DHSA_ENABLE_SDMA=off -DHSA_OVERRIDE_GFX_VERSION=9.0.0 -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=gfx900 -DGPU_TARGETS=gfx900"

Second thing we have to change is in ollama/gpu/amd_linux.go file. Find a line that begins with if totalMemory < IGPUMemLimit {. Just before it add totalMemory = 16 * format.GibiByte, where the value 16 is how much of VRAM can Ollama use for the models. I wouldn't go beyond your_RAM_in_GB - 8. This code should look like this:

		totalMemory = 16 * format.GibiByte
		if totalMemory < IGPUMemLimit {
			slog.Info...

Now Ollama thinks my iGPU has 16 GB of VRAM assigned to it and doesn't complain. Up to 16 GB will be used when Ollama is running and models are loaded, but when we stop the container, our RAM will be free again.

Compile Ollama:

cd ollama
go generate ./...
go build .

Now you can run it with this command:

~/ollama/./ollama serve
<!-- gh-comment-id:2067766641 --> @qkiel commented on GitHub (Apr 20, 2024): @DocMAX thanks for the tip on the compilation. It works with one additional line of code. Let me explain the process from the start, so others can follow. I have AMD 5600G APU and use Ubuntu 22.04. Compiling *Ollama* [requires](https://github.com/ollama/ollama/blob/main/docs/development.md) newer versions of `cmake` and `go` than the ones available in Ubuntu 22.04: * `cmake` version 3.24 or higher * `go` version 1.22 or higher * `gcc` version 11.4.0 or higher * `ROCm` 6.0.3 or 6.1 * `libclblast` for AMD First install ROCm using [official instructions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/amdgpu-install.html#ubuntu). I'm using version 6.1 even though officially it no longer supports GCN-based iGPUs. Next, install some required packages: ``` sudo apt install git ccache libclblast-dev make ``` And finally install [cmake](https://cmake.org/download/) and [go](https://go.dev/doc/install) from official pages: ``` wget https://go.dev/dl/go1.22.3.linux-amd64.tar.gz tar -xzf go1.22.3.linux-amd64.tar.gz wget https://github.com/Kitware/CMake/releases/download/v3.29.3/cmake-3.29.3-linux-x86_64.tar.gz tar -xzf cmake-3.29.3-linux-x86_64.tar.gz ``` We need to add extracted directories to the *PATH*. Open the *.profile* with text editor and add this line at the end (`/home/ubuntu/` depends on your user, so change it accordingly): ``` export PATH=$PATH:/home/ubuntu/go/bin:/home/ubuntu/cmake-3.29.3-linux-x86_64/bin ``` Now use `source ~/.profile` command to make sure environment variable is set. Getting *Ollama* source code is simple, use `git clone` command with a tag of the [latest release](https://github.com/ollama/ollama/releases): ``` git clone --depth 1 --branch v0.1.38 https://github.com/ollama/ollama ``` Let's make two changes in the source code. In the `ollama/llm/generate/gen_linux.sh` file, find a line that begins with `if [ -d "${ROCM_PATH}" ]; then`. A few lines under it, there is a line that begins with `CMAKE_DEFS=`. Because *Ollama* uses *llama.cpp* under the hood, we can add there environment variables required for APU. In my case these are `-DLLAMA_HIP_UMA=on -DHSA_ENABLE_SDMA=off -DHSA_OVERRIDE_GFX_VERSION=9.0.0`. I also changed both `-DAMDGPU_TARGETS=$(amdGPUs)` and `-DGPU_TARGETS=$(amdGPUs)` to `gfx900` (this value depends on your *iGPU* of course). It should look like this: ``` CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=on -DHSA_ENABLE_SDMA=off -DHSA_OVERRIDE_GFX_VERSION=9.0.0 -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=gfx900 -DGPU_TARGETS=gfx900" ``` Second thing we have to change is in `ollama/gpu/amd_linux.go` file. Find a line that begins with `if totalMemory < IGPUMemLimit {`. Just before it add `totalMemory = 16 * format.GibiByte`, where the value `16` is how much of VRAM can *Ollama* use for the models. I wouldn't go beyond `your_RAM_in_GB - 8`. This code should look like this: ``` totalMemory = 16 * format.GibiByte if totalMemory < IGPUMemLimit { slog.Info... ``` Now *Ollama* thinks my iGPU has 16 GB of VRAM assigned to it and doesn't complain. Up to 16 GB will be used when *Ollama* is running and models are loaded, but when we stop the container, our RAM will be free again. Compile *Ollama*: ``` cd ollama go generate ./... go build . ``` Now you can run it with this command: ``` ~/ollama/./ollama serve ```
Author
Owner

@rjmalagon commented on GitHub (Apr 22, 2024):

@DocMAX thanks for the tip on the compilation. It works with one additional line of code.

Let me explain the process from the start, so others can follow. I have AMD 5600G APU and use Ubuntu 22.04.

Compiling Ollama requires newer versions of cmake and go than the ones available in Ubuntu 22.04:

* `cmake` version 3.24 or higher

* `go` version 1.22 or higher

* `gcc` version 11.4.0 or higher

* `ROCm` 6.0.3 or 6.1

* `libclblast` for AMD

First install ROCm using official instructions. I'm using version 6.1 even though officially it no longer supports GCN-based iGPUs.

Next, install some required packages:

sudo apt install git ccache libclblast-dev make

And finally install cmake and go from official pages:

wget https://go.dev/dl/go1.22.2.linux-amd64.tar.gz
tar -xzf go1.22.2.linux-amd64.tar.gz

wget https://github.com/Kitware/CMake/releases/download/v3.29.2/cmake-3.29.2-linux-x86_64.tar.gz
tar -xzf cmake-3.29.2-linux-x86_64.tar.gz

We need to add extracted directories to the PATH. Open the .profile with text editor and add this line at the end (/home/ubuntu/ depends on your user, so change it accordingly):

export PATH=$PATH:/home/ubuntu/go/bin:/home/ubuntu/cmake-3.29.2-linux-x86_64/bin

Now use source ~/.profile command to make sure environment variable is set.

We can download the latest source code for Ollama and extract it. In that file, the .git folder is missing, so we need to clone the git repository and copy it to our project:

wget https://github.com/ollama/ollama/archive/refs/tags/v0.1.32.tar.gz
tar -xzf v0.1.32.tar.gz

git clone https://github.com/ollama/ollama
cp -r ollama/.git ollama-0.1.32/

Let's make two changes in the source code. In the ollama-0.1.32/llm/generate/gen_linux.sh file, find a line that begins with if [ -d "${ROCM_PATH}" ]; then. A few lines under it, there is a line that begins with CMAKE_DEFS=.

Because Ollama uses llama.cpp under the hood, we can add there environment variables required for APU. In my case these are -DLLAMA_HIP_UMA=on -DHSA_ENABLE_SDMA=off -DHSA_OVERRIDE_GFX_VERSION=9.0.0. I also changed both -DAMDGPU_TARGETS=$(amdGPUs) and -DGPU_TARGETS=$(amdGPUs) to gfx900 (this value depends on your iGPU of course). It should look like this:

CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=on -DHSA_ENABLE_SDMA=off -DHSA_OVERRIDE_GFX_VERSION=9.0.0 -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=gfx900 -DGPU_TARGETS=gfx900"

In the second file ollama-0.1.32/gpu/amd_linux.go find a line that begins with if totalMemory < IGPUMemLimit {. Just before it, add totalMemory = totalMemory * 24. Change number 24 to your liking. My APU when set to UMA_AUTO in UEFI/BIOS reports 512 MB of VRAM. Half GB multiplied by 24 equals 12 GB, so now Ollama thinks my iGPU has 12 GB of VRAM assigned to it and doesn't complain. This code should look like this:

		totalMemory = totalMemory * 24
		if totalMemory < IGPUMemLimit {
			slog.Info(fmt.Sprintf("amdgpu [%d] appears to be an iGPU with %dM reported total memory, skipping", id, totalMemory/1024/1024))
			skip[id] = struct{}{}
			continue
		}

Compile Ollama:

cd ollama-0.1.32
go generate ./...
go build .

Now you can run it with this command:

~/ollama-0.1.32/./ollama serve

Yup, this trick works pretty well if you know what are you doing.
Command R was usable on GPU (>32GB of RAM only to load), and I can raise contexts windows to 32000 tokens (>60GB of RAM), around 80G without mayor hiccups. Ryzen 5600G here with 128G RAM. I modified the source and variables on the Dockerfile for a very AMD APU friendly ollama rocm container.

<!-- gh-comment-id:2070632260 --> @rjmalagon commented on GitHub (Apr 22, 2024): > @DocMAX thanks for the tip on the compilation. It works with one additional line of code. > > Let me explain the process from the start, so others can follow. I have AMD 5600G APU and use Ubuntu 22.04. > > Compiling _Ollama_ [requires](https://github.com/ollama/ollama/blob/main/docs/development.md) newer versions of `cmake` and `go` than the ones available in Ubuntu 22.04: > > * `cmake` version 3.24 or higher > > * `go` version 1.22 or higher > > * `gcc` version 11.4.0 or higher > > * `ROCm` 6.0.3 or 6.1 > > * `libclblast` for AMD > > > First install ROCm using [official instructions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/amdgpu-install.html#ubuntu). I'm using version 6.1 even though officially it no longer supports GCN-based iGPUs. > > Next, install some required packages: > > ``` > sudo apt install git ccache libclblast-dev make > ``` > > And finally install [cmake](https://cmake.org/download/) and [go](https://go.dev/doc/install) from official pages: > > ``` > wget https://go.dev/dl/go1.22.2.linux-amd64.tar.gz > tar -xzf go1.22.2.linux-amd64.tar.gz > > wget https://github.com/Kitware/CMake/releases/download/v3.29.2/cmake-3.29.2-linux-x86_64.tar.gz > tar -xzf cmake-3.29.2-linux-x86_64.tar.gz > ``` > > We need to add extracted directories to the _PATH_. Open the _.profile_ with text editor and add this line at the end (`/home/ubuntu/` depends on your user, so change it accordingly): > > ``` > export PATH=$PATH:/home/ubuntu/go/bin:/home/ubuntu/cmake-3.29.2-linux-x86_64/bin > ``` > > Now use `source ~/.profile` command to make sure environment variable is set. > > We can download the [latest source code](https://github.com/ollama/ollama/releases) for _Ollama_ and extract it. In that file, the `.git` folder is missing, so we need to clone the git repository and copy it to our project: > > ``` > wget https://github.com/ollama/ollama/archive/refs/tags/v0.1.32.tar.gz > tar -xzf v0.1.32.tar.gz > > git clone https://github.com/ollama/ollama > cp -r ollama/.git ollama-0.1.32/ > ``` > > Let's make two changes in the source code. In the `ollama-0.1.32/llm/generate/gen_linux.sh` file, find a line that begins with `if [ -d "${ROCM_PATH}" ]; then`. A few lines under it, there is a line that begins with `CMAKE_DEFS=`. > > Because _Ollama_ uses _llama.cpp_ under the hood, we can add there environment variables required for APU. In my case these are `-DLLAMA_HIP_UMA=on -DHSA_ENABLE_SDMA=off -DHSA_OVERRIDE_GFX_VERSION=9.0.0`. I also changed both `-DAMDGPU_TARGETS=$(amdGPUs)` and `-DGPU_TARGETS=$(amdGPUs)` to `gfx900` (this value depends on your _iGPU_ of course). It should look like this: > > ``` > CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=on -DHSA_ENABLE_SDMA=off -DHSA_OVERRIDE_GFX_VERSION=9.0.0 -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=gfx900 -DGPU_TARGETS=gfx900" > ``` > > In the second file `ollama-0.1.32/gpu/amd_linux.go` find a line that begins with `if totalMemory < IGPUMemLimit {`. Just before it, add `totalMemory = totalMemory * 24`. Change number `24` to your liking. My APU when set to `UMA_AUTO` in UEFI/BIOS reports 512 MB of VRAM. Half GB multiplied by `24` equals 12 GB, so now _Ollama_ thinks my iGPU has 12 GB of VRAM assigned to it and doesn't complain. This code should look like this: > > ``` > totalMemory = totalMemory * 24 > if totalMemory < IGPUMemLimit { > slog.Info(fmt.Sprintf("amdgpu [%d] appears to be an iGPU with %dM reported total memory, skipping", id, totalMemory/1024/1024)) > skip[id] = struct{}{} > continue > } > ``` > > Compile _Ollama_: > > ``` > cd ollama-0.1.32 > go generate ./... > go build . > ``` > > Now you can run it with this command: > > ``` > ~/ollama-0.1.32/./ollama serve > ``` Yup, this trick works pretty well if you know what are you doing. Command R was usable on GPU (>32GB of RAM only to load), and I can raise contexts windows to 32000 tokens (>60GB of RAM), around 80G without mayor hiccups. Ryzen 5600G here with 128G RAM. I modified the source and variables on the Dockerfile for a very AMD APU friendly ollama rocm container.
Author
Owner

@DimitriosKakouris commented on GitHub (May 7, 2024):

I run ollama in docker with the official ollama:rocm and I run the container with docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 -e HSA_OVERRIDE_GFX_VERSION="11.0.0" --name ollama ollama/ollama:rocm , that is the same HSA override that @kirel used, I also have the AMD 7840HS with Radeon 780M (gfx1103) but when I type a instruction instead of an answer I get a blank(white screen) with only the mouse cursor working, It does not go away unless I hard-restart.

<!-- gh-comment-id:2099282444 --> @DimitriosKakouris commented on GitHub (May 7, 2024): I run ollama in docker with the official ```ollama:rocm``` and I run the container with ```docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 -e HSA_OVERRIDE_GFX_VERSION="11.0.0" --name ollama ollama/ollama:rocm``` , that is the same HSA override that @kirel used, I also have the AMD 7840HS with Radeon 780M (gfx1103) but when I type a instruction instead of an answer I get a blank(white screen) with only the mouse cursor working, It does not go away unless I hard-restart.
Author
Owner

@kirel commented on GitHub (May 7, 2024):

@ntua-el19019 did you set your VRAM to a high enough value in the BIOS? I set UM_SPECIFIED to 16 Gigs. And here is my full docker-compose.yml

version: '3.8'
services:
  ollama-service:
    image: ollama/ollama:rocm
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    devices:
      - "/dev/dri:/dev/dri"
      - "/dev/kfd:/dev/kfd"
    environment:
      - "HSA_OVERRIDE_GFX_VERSION=11.0.0"
      - "OLLAMA_DEBUG=1"
      - "OLLAMA_MAX_VRAM=16106127360"
      - "OLLAMA_NUM_PARALLEL=2"
      - "OLLAMA_MAX_LOADED_MODELS=2"
    group_add:
      - video
    stdin_open: true
    tty: true
    volumes:
      - ollama_data:/root/.ollama
    ipc: host
    privileged: true
    cap_add:
      - SYS_PTRACE
    security_opt:
      - seccomp=unconfined
<!-- gh-comment-id:2099296600 --> @kirel commented on GitHub (May 7, 2024): @ntua-el19019 did you set your VRAM to a high enough value in the BIOS? I set UM_SPECIFIED to 16 Gigs. And here is my full docker-compose.yml ``` version: '3.8' services: ollama-service: image: ollama/ollama:rocm container_name: ollama restart: unless-stopped ports: - "11434:11434" devices: - "/dev/dri:/dev/dri" - "/dev/kfd:/dev/kfd" environment: - "HSA_OVERRIDE_GFX_VERSION=11.0.0" - "OLLAMA_DEBUG=1" - "OLLAMA_MAX_VRAM=16106127360" - "OLLAMA_NUM_PARALLEL=2" - "OLLAMA_MAX_LOADED_MODELS=2" group_add: - video stdin_open: true tty: true volumes: - ollama_data:/root/.ollama ipc: host privileged: true cap_add: - SYS_PTRACE security_opt: - seccomp=unconfined ```
Author
Owner

@DimitriosKakouris commented on GitHub (May 7, 2024):

@ntua-el19019 did you set your VRAM to a high enough value in the BIOS? I set UM_SPECIFIED to 16 Gigs. And here is my full docker-compose.yml

version: '3.8'
services:
  ollama-service:
    image: ollama/ollama:rocm
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    devices:
      - "/dev/dri:/dev/dri"
      - "/dev/kfd:/dev/kfd"
    environment:
      - "HSA_OVERRIDE_GFX_VERSION=11.0.0"
      - "OLLAMA_DEBUG=1"
      - "OLLAMA_MAX_VRAM=16106127360"
      - "OLLAMA_NUM_PARALLEL=2"
      - "OLLAMA_MAX_LOADED_MODELS=2"
    group_add:
      - video
    stdin_open: true
    tty: true
    volumes:
      - ollama_data:/root/.ollama
    ipc: host
    privileged: true
    cap_add:
      - SYS_PTRACE
    security_opt:
      - seccomp=unconfined

Let me try to increase the VRAM.

<!-- gh-comment-id:2099297845 --> @DimitriosKakouris commented on GitHub (May 7, 2024): > @ntua-el19019 did you set your VRAM to a high enough value in the BIOS? I set UM_SPECIFIED to 16 Gigs. And here is my full docker-compose.yml > > ``` > version: '3.8' > services: > ollama-service: > image: ollama/ollama:rocm > container_name: ollama > restart: unless-stopped > ports: > - "11434:11434" > devices: > - "/dev/dri:/dev/dri" > - "/dev/kfd:/dev/kfd" > environment: > - "HSA_OVERRIDE_GFX_VERSION=11.0.0" > - "OLLAMA_DEBUG=1" > - "OLLAMA_MAX_VRAM=16106127360" > - "OLLAMA_NUM_PARALLEL=2" > - "OLLAMA_MAX_LOADED_MODELS=2" > group_add: > - video > stdin_open: true > tty: true > volumes: > - ollama_data:/root/.ollama > ipc: host > privileged: true > cap_add: > - SYS_PTRACE > security_opt: > - seccomp=unconfined > ``` Let me try to increase the VRAM.
Author
Owner

@DimitriosKakouris commented on GitHub (May 7, 2024):

@ntua-el19019 did you set your VRAM to a high enough value in the BIOS? I set UM_SPECIFIED to 16 Gigs. And here is my full docker-compose.yml

version: '3.8'
services:
  ollama-service:
    image: ollama/ollama:rocm
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    devices:
      - "/dev/dri:/dev/dri"
      - "/dev/kfd:/dev/kfd"
    environment:
      - "HSA_OVERRIDE_GFX_VERSION=11.0.0"
      - "OLLAMA_DEBUG=1"
      - "OLLAMA_MAX_VRAM=16106127360"
      - "OLLAMA_NUM_PARALLEL=2"
      - "OLLAMA_MAX_LOADED_MODELS=2"
    group_add:
      - video
    stdin_open: true
    tty: true
    volumes:
      - ollama_data:/root/.ollama
    ipc: host
    privileged: true
    cap_add:
      - SYS_PTRACE
    security_opt:
      - seccomp=unconfined

Damn it worked!Also changing the boot parameter to amdgpu.sg_display=0 removed the blank screen!
Thank you.

<!-- gh-comment-id:2099304632 --> @DimitriosKakouris commented on GitHub (May 7, 2024): > @ntua-el19019 did you set your VRAM to a high enough value in the BIOS? I set UM_SPECIFIED to 16 Gigs. And here is my full docker-compose.yml > > ``` > version: '3.8' > services: > ollama-service: > image: ollama/ollama:rocm > container_name: ollama > restart: unless-stopped > ports: > - "11434:11434" > devices: > - "/dev/dri:/dev/dri" > - "/dev/kfd:/dev/kfd" > environment: > - "HSA_OVERRIDE_GFX_VERSION=11.0.0" > - "OLLAMA_DEBUG=1" > - "OLLAMA_MAX_VRAM=16106127360" > - "OLLAMA_NUM_PARALLEL=2" > - "OLLAMA_MAX_LOADED_MODELS=2" > group_add: > - video > stdin_open: true > tty: true > volumes: > - ollama_data:/root/.ollama > ipc: host > privileged: true > cap_add: > - SYS_PTRACE > security_opt: > - seccomp=unconfined > ``` Damn it worked!Also changing the boot parameter to ```amdgpu.sg_display=0``` removed the blank screen! Thank you.
Author
Owner

@kirel commented on GitHub (May 7, 2024):

Yup, this trick works pretty well if you know what are you doing. Command R was usable on GPU (>32GB of RAM only to load), and I can raise contexts windows to 32000 tokens (>60GB of RAM), around 80G without mayor hiccups. Ryzen 5600G here with 128G RAM. I modified the source and variables on the Dockerfile for a very AMD APU friendly ollama rocm container.

@rjmalagon do you have your updated Dockerfile somewhere?

<!-- gh-comment-id:2099308759 --> @kirel commented on GitHub (May 7, 2024): > Yup, this trick works pretty well if you know what are you doing. Command R was usable on GPU (>32GB of RAM only to load), and I can raise contexts windows to 32000 tokens (>60GB of RAM), around 80G without mayor hiccups. Ryzen 5600G here with 128G RAM. I modified the source and variables on the Dockerfile for a very AMD APU friendly ollama rocm container. @rjmalagon do you have your updated Dockerfile somewhere?
Author
Owner

@DocMAX commented on GitHub (May 7, 2024):

Haven't looked into it a while. Seems some text changed, but still doesn't work if gpu memory is set to "auto" in BIOS.

time=2024-05-07T23:36:49.728Z level=INFO source=gpu.go:96 msg="Detecting GPUs"

time=2024-05-07T23:36:50.315Z level=DEBUG source=gpu.go:203 msg="Searching for GPU library" name=libcudart.so*

time=2024-05-07T23:36:50.316Z level=DEBUG source=gpu.go:221 msg="gpu library search" globs="[/tmp/ollama1994223302/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /opt/rocm/lib/libcudart.so** /usr/local/lib/libcudart.so** /opt/rh/devtoolset-7/root/libcudart.so**]"

time=2024-05-07T23:36:50.400Z level=DEBUG source=gpu.go:249 msg="discovered GPU libraries" paths=[/tmp/ollama1994223302/runners/cuda_v11/libcudart.so.11.0]

cudaSetDevice err: 35

time=2024-05-07T23:36:50.437Z level=DEBUG source=gpu.go:261 msg="Unable to load cudart" library=/tmp/ollama1994223302/runners/cuda_v11/libcudart.so.11.0 error="your nvidia driver is too old or missing.  If you have a CUDA GPU please upgrade to run ollama"

time=2024-05-07T23:36:50.437Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

time=2024-05-07T23:36:50.437Z level=WARN source=amd_linux.go:49 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"

time=2024-05-07T23:36:50.644Z level=DEBUG source=amd_linux.go:78 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/0/properties"

time=2024-05-07T23:36:50.649Z level=DEBUG source=amd_linux.go:102 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/0/properties"

time=2024-05-07T23:36:50.649Z level=DEBUG source=amd_linux.go:78 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/1/properties"

time=2024-05-07T23:36:50.661Z level=INFO source=amd_linux.go:213 msg="amdgpu appears to be an iGPU, skipping" gpu=0 total="512.0 MiB"

time=2024-05-07T23:36:50.661Z level=INFO source=amd_linux.go:286 msg="no compatible amdgpu devices detected"

Anything i can do?

<!-- gh-comment-id:2099478307 --> @DocMAX commented on GitHub (May 7, 2024): Haven't looked into it a while. Seems some text changed, but still doesn't work if gpu memory is set to "auto" in BIOS. ``` time=2024-05-07T23:36:49.728Z level=INFO source=gpu.go:96 msg="Detecting GPUs" time=2024-05-07T23:36:50.315Z level=DEBUG source=gpu.go:203 msg="Searching for GPU library" name=libcudart.so* time=2024-05-07T23:36:50.316Z level=DEBUG source=gpu.go:221 msg="gpu library search" globs="[/tmp/ollama1994223302/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /opt/rocm/lib/libcudart.so** /usr/local/lib/libcudart.so** /opt/rh/devtoolset-7/root/libcudart.so**]" time=2024-05-07T23:36:50.400Z level=DEBUG source=gpu.go:249 msg="discovered GPU libraries" paths=[/tmp/ollama1994223302/runners/cuda_v11/libcudart.so.11.0] cudaSetDevice err: 35 time=2024-05-07T23:36:50.437Z level=DEBUG source=gpu.go:261 msg="Unable to load cudart" library=/tmp/ollama1994223302/runners/cuda_v11/libcudart.so.11.0 error="your nvidia driver is too old or missing. If you have a CUDA GPU please upgrade to run ollama" time=2024-05-07T23:36:50.437Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-05-07T23:36:50.437Z level=WARN source=amd_linux.go:49 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-05-07T23:36:50.644Z level=DEBUG source=amd_linux.go:78 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/0/properties" time=2024-05-07T23:36:50.649Z level=DEBUG source=amd_linux.go:102 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/0/properties" time=2024-05-07T23:36:50.649Z level=DEBUG source=amd_linux.go:78 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/1/properties" time=2024-05-07T23:36:50.661Z level=INFO source=amd_linux.go:213 msg="amdgpu appears to be an iGPU, skipping" gpu=0 total="512.0 MiB" time=2024-05-07T23:36:50.661Z level=INFO source=amd_linux.go:286 msg="no compatible amdgpu devices detected" ``` Anything i can do?
Author
Owner

@qkiel commented on GitHub (May 8, 2024):

time=2024-05-07T23:36:50.661Z level=INFO source=amd_linux.go:213 msg="amdgpu appears to be an iGPU, skipping" gpu=0 total="512.0 MiB"

Ollama thinks you have too little VRAM available, even though llama.cpp can support UMA and use your RAM. The workaround is to compile ollama with two little changes in the source code. The solution is just a few posts above in this thread:
https://github.com/ollama/ollama/issues/2637#issuecomment-2067766641

<!-- gh-comment-id:2101334196 --> @qkiel commented on GitHub (May 8, 2024): > time=2024-05-07T23:36:50.661Z level=INFO source=amd_linux.go:213 msg="amdgpu appears to be an iGPU, skipping" gpu=0 total="512.0 MiB" Ollama thinks you have too little VRAM available, even though llama.cpp can support UMA and use your RAM. The workaround is to compile ollama with two little changes in the source code. The solution is just a few posts above in this thread: https://github.com/ollama/ollama/issues/2637#issuecomment-2067766641
Author
Owner

@rjmalagon commented on GitHub (May 9, 2024):

Yup, this trick works pretty well if you know what are you doing. Command R was usable on GPU (>32GB of RAM only to load), and I can raise contexts windows to 32000 tokens (>60GB of RAM), around 80G without mayor hiccups. Ryzen 5600G here with 128G RAM. I modified the source and variables on the Dockerfile for a very AMD APU friendly ollama rocm container.

@rjmalagon do you have your updated Dockerfile somewhere?

If you don't mind too much, you can use the current Dockerfile and modify gpu/amd_linux.go

                totalMemory = totalMemory * 15
                if totalMemory < IGPUMemLimit { 

In this example, I multiply by 15 my current 8GB iGPU VRAM to match my 120GB physical RAM (128GB - 8GB)
And modify llm/generate/gen_linux.sh to add "-DLLAMA_HIP_UMA=on" to the ROCM cmake defs.
Then just use docker build with --build-arg=AMDGPU_TARGETS="gfx900" (replace target with your AMD APU)

I can share you a lightly modified ollama docker image if you only want to test it.

<!-- gh-comment-id:2101890646 --> @rjmalagon commented on GitHub (May 9, 2024): > > Yup, this trick works pretty well if you know what are you doing. Command R was usable on GPU (>32GB of RAM only to load), and I can raise contexts windows to 32000 tokens (>60GB of RAM), around 80G without mayor hiccups. Ryzen 5600G here with 128G RAM. I modified the source and variables on the Dockerfile for a very AMD APU friendly ollama rocm container. > > @rjmalagon do you have your updated Dockerfile somewhere? If you don't mind too much, you can use the current Dockerfile and modify gpu/amd_linux.go ``` // iGPU detection, remove this check once we can support an iGPU variant of the rocm library totalMemory = totalMemory * 15 if totalMemory < IGPUMemLimit { ``` In this example, I multiply by 15 my current 8GB iGPU VRAM to match my 120GB physical RAM (128GB - 8GB) And modify llm/generate/gen_linux.sh to add "-DLLAMA_HIP_UMA=on" to the ROCM cmake defs. Then just use docker build with --build-arg=AMDGPU_TARGETS="gfx900" (replace target with your AMD APU) I can share you a lightly modified ollama docker image if you only want to test it.
Author
Owner

@kirel commented on GitHub (May 9, 2024):

There is actually another undocumented env var OLLAMA_CUSTOM_ROCM_DEFS (at least in main). I was able to compile with

sudo docker build --target runtime-rocm --build-arg OLLAMA_CUSTOM_ROCM_DEFS="-DLLAMA_HIP_UMA=on -DHSA_OVERRIDE_GFX_VERSION=11.0.0" --build-arg AMDGPU_TARGETS="gfx1100" . -t ollama:apu

Adapt to you APU obvsly!
And then my only source code modification is totalMemory = ...

<!-- gh-comment-id:2102113224 --> @kirel commented on GitHub (May 9, 2024): There is actually another undocumented env var `OLLAMA_CUSTOM_ROCM_DEFS` (at least in `main`). I was able to compile with ``` sudo docker build --target runtime-rocm --build-arg OLLAMA_CUSTOM_ROCM_DEFS="-DLLAMA_HIP_UMA=on -DHSA_OVERRIDE_GFX_VERSION=11.0.0" --build-arg AMDGPU_TARGETS="gfx1100" . -t ollama:apu ``` Adapt to you APU obvsly! And then my only source code modification is `totalMemory = ...`
Author
Owner

@rjmalagon commented on GitHub (May 9, 2024):

@kirel Yup, your way is cleaner than my hackish way. I have a highly modified environment for special porpoises and quickly adapted a succinct way to replicate the essentials for AMD APU support for this thread, your post is way easier for others to follow.

<!-- gh-comment-id:2102927814 --> @rjmalagon commented on GitHub (May 9, 2024): @kirel Yup, your way is cleaner than my hackish way. I have a highly modified environment for special porpoises and quickly adapted a succinct way to replicate the essentials for AMD APU support for this thread, your post is way easier for others to follow.
Author
Owner

@kirel commented on GitHub (May 9, 2024):

I've been trying to understand how at least in linux ollama understands the memory available. Now in my case reading from /sys/class/kfd/kfd/topology/nodes/0/mem_banks/0/properties in the line that starts with size_in_bytes I can get the CPU memory and from https://www.kernel.org/doc/html/v4.19/gpu/amdgpu.html I interpret 3/4 of that is what can be allocated. So changed the line to

totalMemory = (66066927616 * 3) / 4

for me now and with 64GB of RAM I can now load llama3:70b fully into the "gpu". It's slow but works.
So a proper solution should understand that we have an iGPU and collect totalMemory automatically roughly like I did manually now?

<!-- gh-comment-id:2103435419 --> @kirel commented on GitHub (May 9, 2024): I've been trying to understand how at least in linux ollama understands the memory available. Now in my case reading from `/sys/class/kfd/kfd/topology/nodes/0/mem_banks/0/properties` in the line that starts with `size_in_bytes` I can get the CPU memory and from https://www.kernel.org/doc/html/v4.19/gpu/amdgpu.html I interpret 3/4 of that is what can be allocated. So changed the line to ``` totalMemory = (66066927616 * 3) / 4 ``` for me now and with 64GB of RAM I can now load `llama3:70b` fully into the "gpu". It's slow but works. So a proper solution should understand that we have an iGPU and collect `totalMemory` automatically roughly like I did manually now?
Author
Owner

@santo998 commented on GitHub (May 16, 2024):

I have Ryzen 5 3400g APU and configured VRAM = 8 GB in BIOS.

However, ollama it's only using CPU.

I'm using ollama for Windows (0.1.38 version, without Docker neither anything else) running "ollama run phi3" from command line.

Here you can see my task manager and "ollama ps" command output:

image

<!-- gh-comment-id:2116372853 --> @santo998 commented on GitHub (May 16, 2024): I have Ryzen 5 3400g APU and configured VRAM = 8 GB in BIOS. However, ollama it's only using CPU. I'm using ollama for Windows (0.1.38 version, without Docker neither anything else) running "ollama run phi3" from command line. Here you can see my task manager and "ollama ps" command output: ![image](https://github.com/ollama/ollama/assets/15607273/8f18bcb3-3e28-4c33-8611-b5dcc004884b)
Author
Owner

@rjmalagon commented on GitHub (May 18, 2024):

@santo998 common Ollama binaries will not work for you. Your GPU is old (older than mine) and unsupported by default (just as mine). We did a custom Ollama build with some unofficial changes to the GPU memory count and forced old GPU and main RAM support within ROCM.

I am not familiar with Windows building, but I may be able to help you with the necessary changes to the Ollama source and build scripts.

<!-- gh-comment-id:2118616838 --> @rjmalagon commented on GitHub (May 18, 2024): @santo998 common Ollama binaries will not work for you. Your GPU is old (older than mine) and unsupported by default (just as mine). We did a custom Ollama build with some unofficial changes to the GPU memory count and forced old GPU and main RAM support within ROCM. I am not familiar with Windows building, but I may be able to help you with the necessary changes to the Ollama source and build scripts.
Author
Owner

@santo998 commented on GitHub (May 18, 2024):

@rjmalagon why those changes aren't included in official ollama source code?

Or at least have a fork...

I can try it on Ubuntu too. Which are the changes I should make?

<!-- gh-comment-id:2118628752 --> @santo998 commented on GitHub (May 18, 2024): @rjmalagon why those changes aren't included in official ollama source code? Or at least have a fork... I can try it on Ubuntu too. Which are the changes I should make?
Author
Owner

@rjmalagon commented on GitHub (May 18, 2024):

I can only guess, but strongly feel because ROCM support for GPU compute in old AMD iGPU is incomplete and lacking.
Even with a newer and stronger iGPU, like on Ryzen 5600G, is not that faster (~10% to ~15%) than pure CPU (6 cores /12threads) and I doubt that Ryzen 3400 GPU will take you to significant performance. Main RAM speed is a bottleneck by itself too. Ollama on modest AMD iGPU is useful in very specific local use on small models because you get free CPU to spare while inferencing. Big models (70b+) and you get nothing useful from a modest AMD iGPU.

Maybe for newer and future AMD iGPUs you can get notable performance and will be enough for someone to code Ollama build routines for it.

To enable Ollama for AMD iGPU you will need these three things:
Enable using main memory for the iGPU with "-DLLAMA_HIP_UMA=on"
Trick ollama about how much VRAM you have because you will use main RAM, VRAM measurements are meaningless here
Force ollama build for your iGPU

Take read of this post https://github.com/ollama/ollama/issues/2637#issuecomment-2067766641 and this one https://github.com/ollama/ollama/issues/2637#issuecomment-2102113224, that are more relevant for your consideration.

<!-- gh-comment-id:2118668545 --> @rjmalagon commented on GitHub (May 18, 2024): I can only guess, but strongly feel because ROCM support for GPU compute in old AMD iGPU is incomplete and lacking. Even with a newer and stronger iGPU, like on Ryzen 5600G, is not that faster (~10% to ~15%) than pure CPU (6 cores /12threads) and I doubt that Ryzen 3400 GPU will take you to significant performance. Main RAM speed is a bottleneck by itself too. Ollama on modest AMD iGPU is useful in very specific local use on small models because you get free CPU to spare while inferencing. Big models (70b+) and you get nothing useful from a modest AMD iGPU. Maybe for newer and future AMD iGPUs you can get notable performance and will be enough for someone to code Ollama build routines for it. To enable Ollama for AMD iGPU you will need these three things: Enable using main memory for the iGPU with "-DLLAMA_HIP_UMA=on" Trick ollama about how much VRAM you have because you will use main RAM, VRAM measurements are meaningless here Force ollama build for your iGPU Take read of this post https://github.com/ollama/ollama/issues/2637#issuecomment-2067766641 and this one https://github.com/ollama/ollama/issues/2637#issuecomment-2102113224, that are more relevant for your consideration.
Author
Owner

@arilou commented on GitHub (May 19, 2024):

I have just ran into this thread I have been playing with llama.cpp
and ollama as well. Here is my setup:

  • 5700G
  • Fedora 40

I added the following patch to ollama:

diff --git a/gpu/amd_linux.go b/gpu/amd_linux.go
index 6b08ac2..579186b 100644
--- a/gpu/amd_linux.go
+++ b/gpu/amd_linux.go
@@ -229,6 +229,15 @@ func AMDGetGPUInfo() []GpuInfo {
 		}

 		// iGPU detection, remove this check once we can support an iGPU variant of the rocm library
+		if override, exists := os.LookupEnv("OLLAMA_VRAM_OVERRIDE"); exists {
+			// Convert the environment variable to an integer
+			if value, err := strconv.ParseUint(override, 10, 64); err == nil {
+				totalMemory = value
+			} else {
+				fmt.Println("Error parsing OLLAMA_VRAM_OVERRIDE:", err)
+			}
+		}
+
 		if totalMemory < IGPUMemLimit {
 			slog.Info("unsupported Radeon iGPU detected skipping", "id", gpuID, "total", format.HumanBytes2(totalMemory))
 			continue

Then I built using Docker:

$ sudo docker build --target runtime-rocm --build-arg OLLAMA_CUSTOM_ROCM_DEFS="-DLLAMA_HIP_UMA=on -DHSA_ENABLE_SDMA=off -DHSA_OVERRIDE_GFX_VERSION=9.0.0" --build-arg AMDGPU_TARGETS="gfx900" . -t ollama:apu

Then I extracted the binary from the docker image using docker export

In additioin I downloaded the ROCM parts:
https://github.com/ollama/ollama/releases/download/v0.1.38/ollama-linux-amd64-rocm.tgz

Next part is I took https://github.com/segurac/force-host-alloction-APU and built it:

$ CUDA_PATH=/usr/ HIP_PLATFORM="amd" hipcc forcegttalloc.c -o libforcegttalloc.so  -shared -fPIC

And copy the output libforcegttalloc.so to the same folder

So now you should have a folder with:

.rwxr-xr-x@   140,718,936 jond jond 12 May 01:47 libamd_comgr.so.2
.rwxr-xr-x@    23,520,736 jond jond 12 May 01:47 libamdhip64.so.6
.rwxr-xr-x@       102,240 jond jond 12 May 01:47 libdrm.so.2
.rwxr-xr-x@        57,464 jond jond 12 May 01:47 libdrm_amdgpu.so.1
.rwxr-xr-x@        18,976 jond jond 16 May 20:25 libforcegttalloc.so
.rwxr-xr-x@       866,816 jond jond 12 May 01:47 libhipblas.so.2
.rwxr-xr-x@     2,879,088 jond jond 12 May 01:47 libhsa-runtime64.so.1
.rwxr-xr-x@   846,023,792 jond jond 12 May 01:47 librocblas.so.4
.rwxr-xr-x@ 1,579,606,760 jond jond 12 May 01:47 librocsolver.so.0
.rwxr-xr-x@ 1,245,373,184 jond jond 12 May 01:47 librocsparse.so.1
.rwxr-xr-x@       174,576 jond jond 12 May 01:47 libtinfo.so.5
.rwxr-xr-x@   308,235,672 jond jond 16 May 06:11 ollama
drwxr-xr-x@             - jond jond 12 May 01:47 rocblas

Now to start ollama I would use: (change the VRAM according to your system memory), Now you dont need to take away alot of memory from your system
ram to the VRAM, as it will simply use the system RAM, it should not affect the speed as the iGPU sits on the same SoC as the CPU...

#! /bin/bash
module load rocm/gfx9
LD_PRELOAD=./libforcegttalloc.so OLLAMA_VRAM_OVERRIDE=$((1024*1024*1024*55)) OLLAMA_MODELS=/data/ollama/models/ ROCBLAS_TENSILE_LIBPATH=/home/jond/llama/rocblas/library HSA_XNACK=0 HSA_ENABLE_SDMA=0 HSA_OVERRIDE_GFX_VERSION=9.0.0 ./ollama serve

I have also tried playing with XNACK which as far as I understand should make
memory access faster as page "migrate" to the VRAM (with XNACK I think
libforcegttalloc wont be required).

But I could not get it to work properly as there seems to be a bug in the amdgpu driver, if you want to play with it you need to add to your
kernel params

amdgpu.noretry=0

And then to execute ollama change HSA_XNACK to 1.
(To validate you have XNACK you can execute HSA_XNACK=1 rocminfo | grep xnack, you will see :xnack+ instead of :xnack-)

I opened a ticket for amd driver in hope to get some feedback from them:
https://gitlab.freedesktop.org/drm/amd/-/issues/3386

If you want to "force" enabling XNACK you can change in the docker build from AMDGPU_TARGETS="gfx900" to AMDGPU_TARGETS="gfx900:xnack+"

Will be happy to your results, sorry I did not do the extra step to make it all run from the docker (aka passing the env and libforcegttalloc.so into
the docker but ran out of time to play with it more).

<!-- gh-comment-id:2119102056 --> @arilou commented on GitHub (May 19, 2024): I have just ran into this thread I have been playing with llama.cpp and ollama as well. Here is my setup: * 5700G * Fedora 40 I added the following patch to ollama: ``` diff --git a/gpu/amd_linux.go b/gpu/amd_linux.go index 6b08ac2..579186b 100644 --- a/gpu/amd_linux.go +++ b/gpu/amd_linux.go @@ -229,6 +229,15 @@ func AMDGetGPUInfo() []GpuInfo { } // iGPU detection, remove this check once we can support an iGPU variant of the rocm library + if override, exists := os.LookupEnv("OLLAMA_VRAM_OVERRIDE"); exists { + // Convert the environment variable to an integer + if value, err := strconv.ParseUint(override, 10, 64); err == nil { + totalMemory = value + } else { + fmt.Println("Error parsing OLLAMA_VRAM_OVERRIDE:", err) + } + } + if totalMemory < IGPUMemLimit { slog.Info("unsupported Radeon iGPU detected skipping", "id", gpuID, "total", format.HumanBytes2(totalMemory)) continue ``` Then I built using Docker: ``` $ sudo docker build --target runtime-rocm --build-arg OLLAMA_CUSTOM_ROCM_DEFS="-DLLAMA_HIP_UMA=on -DHSA_ENABLE_SDMA=off -DHSA_OVERRIDE_GFX_VERSION=9.0.0" --build-arg AMDGPU_TARGETS="gfx900" . -t ollama:apu ``` Then I extracted the binary from the docker image using `docker export` In additioin I downloaded the ROCM parts: https://github.com/ollama/ollama/releases/download/v0.1.38/ollama-linux-amd64-rocm.tgz Next part is I took https://github.com/segurac/force-host-alloction-APU and built it: ``` $ CUDA_PATH=/usr/ HIP_PLATFORM="amd" hipcc forcegttalloc.c -o libforcegttalloc.so -shared -fPIC ``` And copy the output libforcegttalloc.so to the same folder So now you should have a folder with: ``` .rwxr-xr-x@ 140,718,936 jond jond 12 May 01:47 libamd_comgr.so.2 .rwxr-xr-x@ 23,520,736 jond jond 12 May 01:47 libamdhip64.so.6 .rwxr-xr-x@ 102,240 jond jond 12 May 01:47 libdrm.so.2 .rwxr-xr-x@ 57,464 jond jond 12 May 01:47 libdrm_amdgpu.so.1 .rwxr-xr-x@ 18,976 jond jond 16 May 20:25 libforcegttalloc.so .rwxr-xr-x@ 866,816 jond jond 12 May 01:47 libhipblas.so.2 .rwxr-xr-x@ 2,879,088 jond jond 12 May 01:47 libhsa-runtime64.so.1 .rwxr-xr-x@ 846,023,792 jond jond 12 May 01:47 librocblas.so.4 .rwxr-xr-x@ 1,579,606,760 jond jond 12 May 01:47 librocsolver.so.0 .rwxr-xr-x@ 1,245,373,184 jond jond 12 May 01:47 librocsparse.so.1 .rwxr-xr-x@ 174,576 jond jond 12 May 01:47 libtinfo.so.5 .rwxr-xr-x@ 308,235,672 jond jond 16 May 06:11 ollama drwxr-xr-x@ - jond jond 12 May 01:47 rocblas ``` Now to start ollama I would use: (change the VRAM according to your system memory), Now you dont need to take away alot of memory from your system ram to the VRAM, as it will simply use the system RAM, it should not affect the speed as the iGPU sits on the same SoC as the CPU... ``` #! /bin/bash module load rocm/gfx9 LD_PRELOAD=./libforcegttalloc.so OLLAMA_VRAM_OVERRIDE=$((1024*1024*1024*55)) OLLAMA_MODELS=/data/ollama/models/ ROCBLAS_TENSILE_LIBPATH=/home/jond/llama/rocblas/library HSA_XNACK=0 HSA_ENABLE_SDMA=0 HSA_OVERRIDE_GFX_VERSION=9.0.0 ./ollama serve ``` I have also tried playing with XNACK which as far as I understand should make memory access faster as page "migrate" to the VRAM (with XNACK I think libforcegttalloc wont be required). But I could not get it to work properly as there seems to be a bug in the amdgpu driver, if you want to play with it you need to add to your kernel params ``` amdgpu.noretry=0 ``` And then to execute ollama change HSA_XNACK to 1. (To validate you have XNACK you can execute `HSA_XNACK=1 rocminfo | grep xnack`, you will see :xnack+ instead of :xnack-) I opened a ticket for amd driver in hope to get some feedback from them: https://gitlab.freedesktop.org/drm/amd/-/issues/3386 If you want to "force" enabling XNACK you can change in the docker build from AMDGPU_TARGETS="gfx900" to AMDGPU_TARGETS="gfx900:xnack+" Will be happy to your results, sorry I did not do the extra step to make it all run from the docker (aka passing the env and libforcegttalloc.so into the docker but ran out of time to play with it more).
Author
Owner

@qkiel commented on GitHub (May 19, 2024):

This is a great idea, we can now change totalMemory on the fly.

I would only make OLLAMA_VRAM_OVERRIDE more human-readable. Value of totalMemory is compared to IGPUMemLimit from ollama/gpu/gpu.go file, which is defined like this:

const IGPUMemLimit = 1 * format.GibiByte

Similarly, in your code you can calculate totalMemory like this:

totalMemory = value * format.GibiByte

And set env OLLAMA_VRAM_OVERRIDE=55.

Question about the .git folder

When I download source for ollama release, for example 0.1.38, there's no .git folder in the archive. I usually get it from master branch:

git clone https://github.com/ollama/ollama
cp -r ollama/.git ollama-0.1.38/

But that didn't work for ollama 0.1.38:

error: patch failed: examples/llava/clip.cpp:573
error: examples/llava/clip.cpp: patch does not apply

Is there a way to re-create a proper .git folder? Simple git init in the ollama-0.1.38 folder doesn't work.

<!-- gh-comment-id:2119225266 --> @qkiel commented on GitHub (May 19, 2024): This is a great idea, we can now change `totalMemory` on the fly. I would only make `OLLAMA_VRAM_OVERRIDE` more human-readable. Value of `totalMemory` is compared to `IGPUMemLimit` from `ollama/gpu/gpu.go` file, which is defined like this: ``` const IGPUMemLimit = 1 * format.GibiByte ``` Similarly, in your code you can calculate `totalMemory` like this: ``` totalMemory = value * format.GibiByte ``` And set env `OLLAMA_VRAM_OVERRIDE=55`. ### Question about the *.git* folder When I download source for *ollama* release, for example 0.1.38, there's no `.git` folder in the archive. I usually get it from master branch: ``` git clone https://github.com/ollama/ollama cp -r ollama/.git ollama-0.1.38/ ``` But that didn't work for *ollama 0.1.38*: ``` error: patch failed: examples/llava/clip.cpp:573 error: examples/llava/clip.cpp: patch does not apply ``` Is there a way to re-create a proper `.git` folder? Simple `git init` in the `ollama-0.1.38` folder doesn't work.
Author
Owner

@arilou commented on GitHub (May 20, 2024):

It's simpler to just clone the git repository and checkout to whatever tag / branch you want to build

<!-- gh-comment-id:2119606715 --> @arilou commented on GitHub (May 20, 2024): It's simpler to just clone the git repository and checkout to whatever tag / branch you want to build
Author
Owner

@qkiel commented on GitHub (May 20, 2024):

It's simpler to just clone the git repository and checkout to whatever tag / branch you want to build

You're right, you can clone repo with a particular tag:

git clone --depth 1 --branch v0.1.38 https://github.com/ollama/ollama

Thanks for the tip.

<!-- gh-comment-id:2119849462 --> @qkiel commented on GitHub (May 20, 2024): > It's simpler to just clone the git repository and checkout to whatever tag / branch you want to build You're right, you can clone repo with a particular tag: ``` git clone --depth 1 --branch v0.1.38 https://github.com/ollama/ollama ``` Thanks for the tip.
Author
Owner

@DocMAX commented on GitHub (May 20, 2024):

Meanwhile is it possible to fix this? (Leaving VRAM to "auto" in BIOS)
time=2024-05-20T10:04:10.752Z level=INFO source=amd_linux.go:233 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB"

<!-- gh-comment-id:2120121015 --> @DocMAX commented on GitHub (May 20, 2024): Meanwhile is it possible to fix this? (Leaving VRAM to "auto" in BIOS) `time=2024-05-20T10:04:10.752Z level=INFO source=amd_linux.go:233 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB"`
Author
Owner

@arilou commented on GitHub (May 20, 2024):

Not sure I'm following if you use the VRAM override env and also libforcegttalloc.so to expose your entire system memory it wont show this print.
The code verifies you have at least 1Gb of ram... pretty sure your system memory has more than 1Gb

<!-- gh-comment-id:2120127293 --> @arilou commented on GitHub (May 20, 2024): Not sure I'm following if you use the VRAM override env and also libforcegttalloc.so to expose your entire system memory it wont show this print. The code verifies you have at least 1Gb of ram... pretty sure your system memory has more than 1Gb
Author
Owner

@DocMAX commented on GitHub (May 20, 2024):

Yes, i have 64GB RAM. Full output:

time=2024-05-20T10:04:10.752Z level=INFO source=amd_linux.go:233 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB"
time=2024-05-20T10:04:10.752Z level=INFO source=amd_linux.go:311 msg="no compatible amdgpu devices detected"
time=2024-05-20T10:04:10.752Z level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-05-20T10:04:08.192Z level=INFO source=images.go:704 msg="total blobs: 41"
time=2024-05-20T10:04:08.193Z level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-05-20T10:04:08.193Z level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)"
time=2024-05-20T10:04:08.194Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2547128236/runners
time=2024-05-20T10:04:10.747Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]"
time=2024-05-20T10:04:10.752Z level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.2 GiB" available="11.3 GiB"
<!-- gh-comment-id:2120146151 --> @DocMAX commented on GitHub (May 20, 2024): Yes, i have 64GB RAM. Full output: ``` time=2024-05-20T10:04:10.752Z level=INFO source=amd_linux.go:233 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB" time=2024-05-20T10:04:10.752Z level=INFO source=amd_linux.go:311 msg="no compatible amdgpu devices detected" time=2024-05-20T10:04:10.752Z level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-05-20T10:04:08.192Z level=INFO source=images.go:704 msg="total blobs: 41" time=2024-05-20T10:04:08.193Z level=INFO source=images.go:711 msg="total unused blobs removed: 0" time=2024-05-20T10:04:08.193Z level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)" time=2024-05-20T10:04:08.194Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2547128236/runners time=2024-05-20T10:04:10.747Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]" time=2024-05-20T10:04:10.752Z level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.2 GiB" available="11.3 GiB" ```
Author
Owner

@arilou commented on GitHub (May 20, 2024):

It seems like you either ran without OLLAMA_VRAM_OVERRIDE=$((102410241024*55)) or you did not rebuild with my patch diff..

<!-- gh-comment-id:2120148812 --> @arilou commented on GitHub (May 20, 2024): It seems like you either ran without OLLAMA_VRAM_OVERRIDE=$((1024*1024*1024*55)) or you did not rebuild with my patch diff..
Author
Owner

@DocMAX commented on GitHub (May 20, 2024):

I'm using docker image ollama/ollama:rocm

<!-- gh-comment-id:2120150963 --> @DocMAX commented on GitHub (May 20, 2024): I'm using docker image ollama/ollama:rocm
Author
Owner

@arilou commented on GitHub (May 20, 2024):

What I suggested is a change over ollama, OLLAMA_VRAM_OVERRIDE, this is not part of ollama today...

<!-- gh-comment-id:2120153673 --> @arilou commented on GitHub (May 20, 2024): What I suggested is a change over ollama, OLLAMA_VRAM_OVERRIDE, this is not part of ollama today...
Author
Owner

@DocMAX commented on GitHub (May 20, 2024):

OK, then i have to wait for the docker version, because i want stay on docker.

<!-- gh-comment-id:2120155991 --> @DocMAX commented on GitHub (May 20, 2024): OK, then i have to wait for the docker version, because i want stay on docker.
Author
Owner

@qkiel commented on GitHub (May 20, 2024):

Not sure I'm following if you use the VRAM override env and also libforcegttalloc.so to expose your entire system memory it wont show this print. The code verifies you have at least 1Gb of ram... pretty sure your system memory has more than 1Gb

Curious question - why do you use libforcegttalloc.so with ollama? Isn't it only intended for use with applications that require PyTorch? Without LD_PRELOAD everything should work exactly the same.

<!-- gh-comment-id:2120681550 --> @qkiel commented on GitHub (May 20, 2024): > Not sure I'm following if you use the VRAM override env and also libforcegttalloc.so to expose your entire system memory it wont show this print. The code verifies you have at least 1Gb of ram... pretty sure your system memory has more than 1Gb Curious question - why do you use *libforcegttalloc.so* with *ollama*? Isn't it only intended for use with applications that require *PyTorch*? Without *LD_PRELOAD* everything should work exactly the same.
Author
Owner

@arilou commented on GitHub (May 20, 2024):

Well the reason is that if you will look when you compile ollama/llama.cpp (even with LLAMA_HIP_UMA=on)
It will charge the VRAM memory (you can use radeontop/amdgpu_top)
That limits you to amount of VRAM you can assign in the BIOS (max is 16Gb)
But let's say on my system I have 64Gb of memory, there is no reason I wont be able to load much larger models like 50Gb
After all there is no "real" meaning to those 16Gb being VRam, they sit on the say DIMM... so by using the trick done
in libforcegttalloc.so you basically charge memory from the OS but since we are all here with APUs it's the same thing
we just need to go through hips in order for the iGFX to understand it can access those pointers regularly

So with this trick you can now load models much bigger, and "steal" less memory from your system for your GPU.

So for example I loaded llama3:70b-instruct-q4_K_M which is about 40Gb, and I still get 0.8tps which is fairly ok for the power of our iGPU...

<!-- gh-comment-id:2120748593 --> @arilou commented on GitHub (May 20, 2024): Well the reason is that if you will look when you compile ollama/llama.cpp (even with `LLAMA_HIP_UMA=on`) It will charge the VRAM memory (you can use radeontop/amdgpu_top) That limits you to amount of VRAM you can assign in the BIOS (max is 16Gb) But let's say on my system I have 64Gb of memory, there is no reason I wont be able to load much larger models like 50Gb After all there is no "real" meaning to those 16Gb being VRam, they sit on the say DIMM... so by using the trick done in _libforcegttalloc.so_ you basically charge memory from the OS but since we are all here with APUs it's the same thing we just need to go through hips in order for the iGFX to understand it can access those pointers regularly So with this trick you can now load models much bigger, and "steal" less memory from your system for your GPU. So for example I loaded llama3:70b-instruct-q4_K_M which is about 40Gb, and I still get 0.8tps which is fairly ok for the power of our iGPU...
Author
Owner

@qkiel commented on GitHub (May 20, 2024):

Well the reason is that if you will look when you compile ollama/llama.cpp (even with LLAMA_HIP_UMA=on) It will charge the VRAM memory (you can use radeontop/amdgpu_top) That limits you to amount of VRAM you can assign in the BIOS (max is 16Gb) But let's say on my system I have 64Gb of memory, there is no reason I wont be able to load much larger models like 50Gb After all there is no "real" meaning to those 16Gb being VRam, they sit on the say DIMM... so by using the trick done in libforcegttalloc.so you basically charge memory from the OS but since we are all here with APUs it's the same thing we just need to go through hips in order for the iGFX to understand it can access those pointers regularly

So with this trick you can now load models much bigger, and "steal" less memory from your system for your GPU.

So for example I loaded llama3:70b-instruct-q4_K_M which is about 40Gb, and I still get 0.8tps which is fairly ok for the power of our iGPU...

Interesting. I have an AMD 5600G APU with UMA_AUTO set in UEFI/BIOS (which means 512 MB is taken from my RAM for VRAM). On my Ubuntu 22.04 the libforcegttalloc.so is required only for Stable Diffusion apps like Fooocus.

Running ollama with or without LD_PRELOAD makes no difference in my case. VRAM is kept at 512 MB, models are loaded to RAM, and the compute is done on GPU.

Have you tried running ollama without LD_PRELOAD?

<!-- gh-comment-id:2120825353 --> @qkiel commented on GitHub (May 20, 2024): > Well the reason is that if you will look when you compile ollama/llama.cpp (even with `LLAMA_HIP_UMA=on`) It will charge the VRAM memory (you can use radeontop/amdgpu_top) That limits you to amount of VRAM you can assign in the BIOS (max is 16Gb) But let's say on my system I have 64Gb of memory, there is no reason I wont be able to load much larger models like 50Gb After all there is no "real" meaning to those 16Gb being VRam, they sit on the say DIMM... so by using the trick done in _libforcegttalloc.so_ you basically charge memory from the OS but since we are all here with APUs it's the same thing we just need to go through hips in order for the iGFX to understand it can access those pointers regularly > > So with this trick you can now load models much bigger, and "steal" less memory from your system for your GPU. > > So for example I loaded llama3:70b-instruct-q4_K_M which is about 40Gb, and I still get 0.8tps which is fairly ok for the power of our iGPU... Interesting. I have an AMD 5600G APU with *UMA_AUTO* set in UEFI/BIOS (which means 512 MB is taken from my RAM for VRAM). On my *Ubuntu 22.04* the *libforcegttalloc.so* is required only for Stable Diffusion apps like Fooocus. Running *ollama* with or without *LD_PRELOAD* makes no difference in my case. VRAM is kept at 512 MB, models are loaded to RAM, and the compute is done on GPU. Have you tried running *ollama* without *LD_PRELOAD*?
Author
Owner

@arilou commented on GitHub (May 20, 2024):

Interesting for me it crashes if I tried to module bigger than the allocated VRAM, I wonder if it's an issue because in Fedora 40, the default ROCm is 6.0.

<!-- gh-comment-id:2120834517 --> @arilou commented on GitHub (May 20, 2024): Interesting for me it crashes if I tried to module bigger than the allocated VRAM, I wonder if it's an issue because in Fedora 40, the default ROCm is 6.0.
Author
Owner

@Jonnybravo commented on GitHub (May 20, 2024):

Hey, I also have a 5600G and I wanted to make use of it. I read the whole thread, but I'm confused about which steps should I do to change the build in order to make this work with this iGPU.

Is there a version already pre-compiled that everyone can use? I tried to follow @qkiel steps, but this fails miserably when I try to compile and build using go...

<!-- gh-comment-id:2121039155 --> @Jonnybravo commented on GitHub (May 20, 2024): Hey, I also have a 5600G and I wanted to make use of it. I read the whole thread, but I'm confused about which steps should I do to change the build in order to make this work with this iGPU. Is there a version already pre-compiled that everyone can use? I tried to follow @qkiel steps, but this fails miserably when I try to compile and build using go...
Author
Owner

@qkiel commented on GitHub (May 20, 2024):

Hey, I also have a 5600G and I wanted to make use of it. I read the whole thread, but I'm confused about which steps should I do to change the build in order to make this work with this iGPU.

Is there a version already pre-compiled that everyone can use? I tried to follow @qkiel steps, but this fails miserably when I try to compile and build using go...

When you download source code and compile it with commands below, do you still get an error?

git clone --depth 1 --branch v0.1.38 https://github.com/ollama/ollama
cd ollama
go generate ./...
go build .
<!-- gh-comment-id:2121218247 --> @qkiel commented on GitHub (May 20, 2024): > Hey, I also have a 5600G and I wanted to make use of it. I read the whole thread, but I'm confused about which steps should I do to change the build in order to make this work with this iGPU. > > Is there a version already pre-compiled that everyone can use? I tried to follow @qkiel steps, but this fails miserably when I try to compile and build using go... When you download source code and compile it with commands below, do you still get an error? ``` git clone --depth 1 --branch v0.1.38 https://github.com/ollama/ollama cd ollama go generate ./... go build . ```
Author
Owner

@Jonnybravo commented on GitHub (May 20, 2024):

Last time I ran the generate command I got this:

image

I didn't try anything else after this. Before I got here, the generate command would give me an error related to the wrong version of go being installed.

<!-- gh-comment-id:2121331362 --> @Jonnybravo commented on GitHub (May 20, 2024): Last time I ran the generate command I got this: ![image](https://github.com/ollama/ollama/assets/1081108/1a11e26b-62ee-45ab-812b-b607499c7019) I didn't try anything else after this. Before I got here, the generate command would give me an error related to the wrong version of go being installed.
Author
Owner

@qkiel commented on GitHub (May 21, 2024):

I updated my instruction a bit, see if it works this time. If not, I can send you my binary.

<!-- gh-comment-id:2121926286 --> @qkiel commented on GitHub (May 21, 2024): I updated my [instruction](https://github.com/ollama/ollama/issues/2637#issuecomment-2067766641) a bit, see if it works this time. If not, I can send you my binary.
Author
Owner

@Jonnybravo commented on GitHub (May 21, 2024):

I followed everything again and made sure about the versions of the requirements. This time I managed to pass the generate step, but it seems that I have a problem with the goroot path when I try to run the builder command:

image

Could it be because of this being installed in a custom folder?

EDIT: Meanwhile I tried to point both variables to different paths, but now I have an error on GOPROXY. @qkiel, did you also experienced this? What are your paths for each variable GOROOT, GOPATH and GOPROXY?

<!-- gh-comment-id:2122359368 --> @Jonnybravo commented on GitHub (May 21, 2024): I followed everything again and made sure about the versions of the requirements. This time I managed to pass the generate step, but it seems that I have a problem with the goroot path when I try to run the builder command: ![image](https://github.com/ollama/ollama/assets/1081108/51091a21-97b0-4959-8b6b-c6d55b80d980) Could it be because of this being installed in a custom folder? EDIT: Meanwhile I tried to point both variables to different paths, but now I have an error on GOPROXY. @qkiel, did you also experienced this? What are your paths for each variable GOROOT, GOPATH and GOPROXY?
Author
Owner

@qkiel commented on GitHub (May 21, 2024):

This warning doesn't matter, just run ollama:

/<path>/./ollama serve

Then in a second terminal window:

/<path>/./ollama --help
<!-- gh-comment-id:2122571227 --> @qkiel commented on GitHub (May 21, 2024): This warning doesn't matter, just run *ollama*: ``` /<path>/./ollama serve ``` Then in a second terminal window: ``` /<path>/./ollama --help ```
Author
Owner

@Jonnybravo commented on GitHub (May 21, 2024):

This warning doesn't matter, just run ollama:

/<path>/./ollama serve

Then in a second terminal window:

/<path>/./ollama --help

I did that and I believe I'm still running it using the CPU. Is there a way to confirm that I'm running this using the GPU?

<!-- gh-comment-id:2122880327 --> @Jonnybravo commented on GitHub (May 21, 2024): > This warning doesn't matter, just run _ollama_: > > ``` > /<path>/./ollama serve > ``` > > Then in a second terminal window: > > ``` > /<path>/./ollama --help > ``` I did that and I believe I'm still running it using the CPU. Is there a way to confirm that I'm running this using the GPU?
Author
Owner

@qkiel commented on GitHub (May 21, 2024):

Look at GPU utilization. I use nvtop for that (also available as a snap):

sudo apt install nvtop
<!-- gh-comment-id:2122911511 --> @qkiel commented on GitHub (May 21, 2024): Look at GPU utilization. I use [nvtop](https://github.com/Syllo/nvtop) for that (also available as a *snap*): ``` sudo apt install nvtop ```
Author
Owner

@Jonnybravo commented on GitHub (May 21, 2024):

image

Installed it, but I think I messed up on the ROCm installation. Do I need to do some extra step besides this?:

sudo apt update wget https://repo.radeon.com/amdgpu-install/6.1.1/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.deb sudo apt install ./amdgpu-install_6.1.60101-1_all.deb

If it helps, I'm using windows with wsl.

<!-- gh-comment-id:2123030785 --> @Jonnybravo commented on GitHub (May 21, 2024): ![image](https://github.com/ollama/ollama/assets/1081108/b5a89615-755a-49b6-8f6f-89f68a926f20) Installed it, but I think I messed up on the ROCm installation. Do I need to do some extra step besides this?: `sudo apt update wget https://repo.radeon.com/amdgpu-install/6.1.1/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.deb sudo apt install ./amdgpu-install_6.1.60101-1_all.deb` If it helps, I'm using windows with wsl.
Author
Owner

@qkiel commented on GitHub (May 21, 2024):

Unfortunately, there is no equivalent of the HSA_OVERRIDE_GFX_VERSION environment variable on Windows, so you cannot present your iGPU to ROCm as supported.

Secondly, you install ROCm differently on Windows.

I don't think it can be done on Windows the same way as on Linux.

Edit:

sudo apt update wget https://repo.radeon.com/amdgpu-install/6.1.1/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.deb sudo apt install ./amdgpu-install_6.1.60101-1_all.deb

Besides that, you can this to install ROCm:

sudo amdgpu-install --usecase=rocm --no-dkms

But chances of success are very slim.

<!-- gh-comment-id:2123143575 --> @qkiel commented on GitHub (May 21, 2024): Unfortunately, there is [no equivalent](https://github.com/ROCm/ROCm/issues/2654) of the `HSA_OVERRIDE_GFX_VERSION` environment variable on Windows, so you cannot present your iGPU to ROCm as supported. Secondly, you [install ROCm](https://rocm.docs.amd.com/projects/install-on-windows/en/latest/how-to/install.html) differently on Windows. I don't think it can be done on Windows the same way as on Linux. **Edit:** > `sudo apt update wget https://repo.radeon.com/amdgpu-install/6.1.1/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.deb sudo apt install ./amdgpu-install_6.1.60101-1_all.deb` Besides that, you can this to install ROCm: ``` sudo amdgpu-install --usecase=rocm --no-dkms ``` But chances of success are very slim.
Author
Owner

@Jonnybravo commented on GitHub (May 22, 2024):

Yeah, I tried it and got the same problem mentioned on this thread: https://github.com/ROCm/ROCm/issues/3051

And what about a Virtual Machine running linux, @qkiel? Do you think that could work or am I stretching too much here?

<!-- gh-comment-id:2124385639 --> @Jonnybravo commented on GitHub (May 22, 2024): Yeah, I tried it and got the same problem mentioned on this thread: https://github.com/ROCm/ROCm/issues/3051 And what about a Virtual Machine running linux, @qkiel? Do you think that could work or am I stretching too much here?
Author
Owner

@qkiel commented on GitHub (May 22, 2024):

I have no idea. If you have a regular GPU, then you can pass iGPU to the VM and that could work. I don't think that 5600G supports SR-IOV so you can't partition iGPU and pass only part of it.

<!-- gh-comment-id:2125381016 --> @qkiel commented on GitHub (May 22, 2024): I have no idea. If you have a regular GPU, then you can pass iGPU to the VM and that could work. I don't think that 5600G supports SR-IOV so you can't partition iGPU and pass only part of it.
Author
Owner

@xwry commented on GitHub (May 26, 2024):

This warning doesn't matter, just run ollama:

/<path>/./ollama serve

Then in a second terminal window:

/<path>/./ollama --help

I did that and I believe I'm still running it using the CPU. Is there a way to confirm that I'm running this using the GPU?

You can try radeontop, it works fine on iGPU from AMD, -c flag ads colorized output.
sudo apt install radeontop

<!-- gh-comment-id:2132127425 --> @xwry commented on GitHub (May 26, 2024): > > This warning doesn't matter, just run _ollama_: > > ``` > > /<path>/./ollama serve > > ``` > > > > > > > > > > > > > > > > > > > > > > > > Then in a second terminal window: > > ``` > > /<path>/./ollama --help > > ``` > > I did that and I believe I'm still running it using the CPU. Is there a way to confirm that I'm running this using the GPU? You can try radeontop, it works fine on iGPU from AMD, -c flag ads colorized output. `sudo apt install radeontop`
Author
Owner

@smellouk commented on GitHub (Jun 19, 2024):

@qkiel thx for this tip 🙏
I followed the steps as you describe but I'm facing this error

Error: llama runner process has terminated: signal: aborted error:Could not initialize Tensile host: No devices found

My current setup:

  • Proxmox running on host machine
  • LXC ubuntu22.04
  • Shared GPU

What is crazy now, is that if I install docker in this LXC and run docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama --device=/dev/kfd --device=/dev/dri/renderD128 --env HSA_OVERRIDE_GFX_VERSION=9.0.0 --env HSA_ENABLE_SDMA=0 ollama/ollama:rocm everything work great. More details here: https://github.com/ollama/ollama/issues/5143#issuecomment-2179538572

<!-- gh-comment-id:2179568877 --> @smellouk commented on GitHub (Jun 19, 2024): @qkiel thx for this [tip](https://github.com/ollama/ollama/issues/2637#issuecomment-2067766641) 🙏 I followed the steps as you describe but I'm facing this error ``` Error: llama runner process has terminated: signal: aborted error:Could not initialize Tensile host: No devices found ``` My current setup: * Proxmox running on host machine * LXC ubuntu22.04 * Shared GPU What is crazy now, is that if I install docker in this LXC and run `docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama --device=/dev/kfd --device=/dev/dri/renderD128 --env HSA_OVERRIDE_GFX_VERSION=9.0.0 --env HSA_ENABLE_SDMA=0 ollama/ollama:rocm` everything work great. More details here: https://github.com/ollama/ollama/issues/5143#issuecomment-2179538572
Author
Owner

@qkiel commented on GitHub (Jun 20, 2024):

@smellouk I have tutorials on how I install ROCm and Ollama in Icnus containers (fork of LXD):

Do you do this similarly or somehow differently?

<!-- gh-comment-id:2181233992 --> @qkiel commented on GitHub (Jun 20, 2024): @smellouk I have tutorials on how I install ROCm and Ollama in Icnus containers (fork of LXD): - [Ai tutorial: ROCm and PyTorch on AMD APU or GPU](https://discuss.linuxcontainers.org/t/ai-tutorial-rocm-and-pytorch-on-amd-apu-or-gpu/19743) - [Ai tutorial: llama.cpp and Ollama servers + plugins for VS Code / VS Codium and IntelliJ](https://discuss.linuxcontainers.org/t/ai-tutorial-llama-cpp-and-ollama-servers-plugins-for-vs-code-vs-codium-and-intellij/19744) Do you do this similarly or somehow differently?
Author
Owner

@smellouk commented on GitHub (Jun 21, 2024):

@qkiel I used that article and I just noticed you are the same owner 😆, that Ai tutorial: ROCm and PyTorch on AMD APU or GPU led to me here. I follow everything and same issue 😢

<!-- gh-comment-id:2183091288 --> @smellouk commented on GitHub (Jun 21, 2024): @qkiel I used that article and I just noticed you are the same owner 😆, that [Ai tutorial: ROCm and PyTorch on AMD APU or GPU](https://discuss.linuxcontainers.org/t/ai-tutorial-rocm-and-pytorch-on-amd-apu-or-gpu/19743) led to me here. I follow everything and same issue 😢
Author
Owner

@qkiel commented on GitHub (Jun 21, 2024):

When you run this command, what do you see?

ls -alF /dev/dri

Do card0 and renderD128 belong to video or render group?

crw-rw----  1 root video 226,   0 cze 21 22:07 card0
crw-rw----  1 root video 226, 128 cze 21 22:07 renderD128

If they belong to root root, that means you didn't set a proper gid when adding the GPU device to the container. For Ubuntu containers, that would be:

incus config device add <container_name> gpu gpu gid=44

Or your user inside the container doesn't belong to video and render groups. For Ubuntu containers, that would be (this requires a restart of the container to take effect):

sudo usermod -a -G render,video ubuntu
<!-- gh-comment-id:2183395180 --> @qkiel commented on GitHub (Jun 21, 2024): When you run this command, what do you see? ``` ls -alF /dev/dri ``` Do `card0` and `renderD128` belong to `video` or `render` group? ``` crw-rw---- 1 root video 226, 0 cze 21 22:07 card0 crw-rw---- 1 root video 226, 128 cze 21 22:07 renderD128 ``` If they belong to `root root`, that means you didn't set a proper `gid` when adding the GPU device to the container. For Ubuntu containers, that would be: ``` incus config device add <container_name> gpu gpu gid=44 ``` Or your user inside the container doesn't belong to `video` and `render` groups. For Ubuntu containers, that would be (this requires a restart of the container to take effect): ``` sudo usermod -a -G render,video ubuntu ```
Author
Owner

@smellouk commented on GitHub (Jun 22, 2024):

@qkiel permissions are correct as expected

<!-- gh-comment-id:2183974971 --> @smellouk commented on GitHub (Jun 22, 2024): @qkiel permissions are correct as expected
Author
Owner

@arilou commented on GitHub (Jun 25, 2024):

I added the following patch to ollama:

diff --git a/gpu/amd_linux.go b/gpu/amd_linux.go
index 6b08ac2..579186b 100644
--- a/gpu/amd_linux.go
+++ b/gpu/amd_linux.go
@@ -229,6 +229,15 @@ func AMDGetGPUInfo() []GpuInfo {
 		}

 		// iGPU detection, remove this check once we can support an iGPU variant of the rocm library
+		if override, exists := os.LookupEnv("OLLAMA_VRAM_OVERRIDE"); exists {
+			// Convert the environment variable to an integer
+			if value, err := strconv.ParseUint(override, 10, 64); err == nil {
+				totalMemory = value
+			} else {
+				fmt.Println("Error parsing OLLAMA_VRAM_OVERRIDE:", err)
+			}
+		}
+
 		if totalMemory < IGPUMemLimit {
 			slog.Info("unsupported Radeon iGPU detected skipping", "id", gpuID, "total", format.HumanBytes2(totalMemory))
 			continue

@dhiltgen perhaps you want to consider adding this patch to ollama? (I dont have any NVIDIA computer to test and do the same for CUDA, or whatever Intel has/will have) to test this with, but i know it works well for AMD

<!-- gh-comment-id:2187854444 --> @arilou commented on GitHub (Jun 25, 2024): > I added the following patch to ollama: > > ``` > diff --git a/gpu/amd_linux.go b/gpu/amd_linux.go > index 6b08ac2..579186b 100644 > --- a/gpu/amd_linux.go > +++ b/gpu/amd_linux.go > @@ -229,6 +229,15 @@ func AMDGetGPUInfo() []GpuInfo { > } > > // iGPU detection, remove this check once we can support an iGPU variant of the rocm library > + if override, exists := os.LookupEnv("OLLAMA_VRAM_OVERRIDE"); exists { > + // Convert the environment variable to an integer > + if value, err := strconv.ParseUint(override, 10, 64); err == nil { > + totalMemory = value > + } else { > + fmt.Println("Error parsing OLLAMA_VRAM_OVERRIDE:", err) > + } > + } > + > if totalMemory < IGPUMemLimit { > slog.Info("unsupported Radeon iGPU detected skipping", "id", gpuID, "total", format.HumanBytes2(totalMemory)) > continue > ``` @dhiltgen perhaps you want to consider adding this patch to ollama? (I dont have any NVIDIA computer to test and do the same for CUDA, or whatever Intel has/will have) to test this with, but i know it works well for AMD
Author
Owner

@wszgrcy commented on GitHub (Jul 28, 2024):

HSA_OVERRIDE_GFX_VERSION

I can set environmen HSA_OVERRIDE_GFX_VERSION run ollama by msys2,And it can also correctly recognize the graphics card
However, when executing the 'ollama run', it prompts that gfx1103 is missing

image
image

<!-- gh-comment-id:2254525085 --> @wszgrcy commented on GitHub (Jul 28, 2024): > HSA_OVERRIDE_GFX_VERSION I can set environmen `HSA_OVERRIDE_GFX_VERSION` run ollama by `msys2`,And it can also correctly recognize the graphics card However, when executing the 'ollama run', it prompts that gfx1103 is missing ![image](https://github.com/user-attachments/assets/bec4b9ab-728c-4478-a674-b614e3abd624) ![image](https://github.com/user-attachments/assets/f3bb818f-7643-4d3c-80d7-4f6d5c95d8d7)
Author
Owner

@dhiltgen commented on GitHub (Aug 1, 2024):

@wszgrcy you've hit #3107

<!-- gh-comment-id:2264232270 --> @dhiltgen commented on GitHub (Aug 1, 2024): @wszgrcy you've hit #3107
Author
Owner

@MaciejMogilany commented on GitHub (Aug 7, 2024):

on ollama version 0.3.2 with kernel 6.10.3-3 on Arch based distro and ollama Environment variables as below

Environment="OLLAMA_HOST=0.0.0.0"
Environment="HSA_OVERRIDE_GFX_VERSION=11.0.1"
Environment="HSA_ENABLE_SDMA=0"
Environment="HCC_AMDGPU_TARGET=gfx1101"
Environment="OLLAMA_FLASH_ATTENTION=1"

Probably only needed is Environment="HSA_OVERRIDE_GFX_VERSION=11.0.1" for RDNA 3 GPU like 780M

no other hack applied (official ollama) on Ryzen 7 7840HS with 780M integrate GPU, llm-s load straight to GTT memory bypassing allocated VRAM as below

ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3.1:8b-instruct-q5_K_M 3cfab818fbe8 7.7 GB 100% GPU 4 minutes from now

amdgpu_top
...
Memory Usage

VRAM: [ 1309 / 16384 MiB ] GTT: [ 7215 / 40105 MiB ]
...

Problem now is that Memory available for ollama is calculated by allocated VRAM in bios in my case I can use only 16GB despite having available 42GB GTT and wasted 16GB unused VRAM which is bypassed on kernel 6.10.

It's there any chance to alter script allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?

This will allow using a default carveout of VRAM of 0.5 GB, so no memory is wasted and load full 70B q4 model to GPU memory (on 96GB RAM system)

At this stage, using APU GPU is feasible when the model fits in GTT memory, which on 96GB RAM system is around 47GB with 1GB VRAM carveout. In my opinion if model fits in GTT use GPU if not use CPU. I find some instability in big quantized models above 70B that offload partially to GPU on a system with 96GB of RAM. This logic will mitigate this, as splitting model for CPU and GPU require 2x of RAM (two copies one for CPU one for GPU) on consumer APU systems (they do not support xnack)

<!-- gh-comment-id:2272913656 --> @MaciejMogilany commented on GitHub (Aug 7, 2024): on ollama version 0.3.2 with kernel 6.10.3-3 on Arch based distro and ollama Environment variables as below Environment="OLLAMA_HOST=0.0.0.0" Environment="HSA_OVERRIDE_GFX_VERSION=11.0.1" Environment="HSA_ENABLE_SDMA=0" Environment="HCC_AMDGPU_TARGET=gfx1101" Environment="OLLAMA_FLASH_ATTENTION=1" Probably only needed is Environment="HSA_OVERRIDE_GFX_VERSION=11.0.1" for RDNA 3 GPU like 780M no other hack applied (official ollama) on Ryzen 7 7840HS with 780M integrate GPU, llm-s load straight to GTT memory bypassing allocated VRAM as below ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.1:8b-instruct-q5_K_M 3cfab818fbe8 7.7 GB 100% GPU 4 minutes from now amdgpu_top ... Memory Usage VRAM: [ 1309 / 16384 MiB ] GTT: [ 7215 / 40105 MiB ] ... Problem now is that Memory available for ollama is calculated by allocated VRAM in bios in my case I can use only 16GB despite having available 42GB GTT and wasted 16GB unused VRAM which is bypassed on kernel 6.10. **It's there any chance to alter script allocating total memory for [APU on 6.10+ kernels ](https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs)based on GTT not VRAM?** This will allow using a default carveout of VRAM of 0.5 GB, so no memory is wasted and load full 70B q4 model to GPU memory (on 96GB RAM system) At this stage, using APU GPU is feasible when the model fits in GTT memory, which on 96GB RAM system is around 47GB with 1GB VRAM carveout. In my opinion if model fits in GTT use GPU if not use CPU. I find some instability in big quantized models above 70B that offload partially to GPU on a system with 96GB of RAM. This logic will mitigate this, as splitting model for CPU and GPU require 2x of RAM (two copies one for CPU one for GPU) on consumer APU systems (they do not support xnack)
Author
Owner

@sebastian-philipp commented on GitHub (Aug 7, 2024):

It's there any chance to alter script allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?

Relates to https://github.com/ggerganov/llama.cpp/issues/7145 . I'd search for DLLAMA_HIP_UMA in the comments above as they re-compile llama with the needed flag. And yes, I'm also still on CPU-only mode here and waiting for any progress with Vulkan or this issue.

<!-- gh-comment-id:2273015542 --> @sebastian-philipp commented on GitHub (Aug 7, 2024): > **It's there any chance to alter script allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?** Relates to https://github.com/ggerganov/llama.cpp/issues/7145 . I'd search for `DLLAMA_HIP_UMA` in the comments above as they re-compile llama with the needed flag. And yes, I'm also still on CPU-only mode here and waiting for any progress with Vulkan or this issue.
Author
Owner

@MaciejMogilany commented on GitHub (Aug 7, 2024):

It's there any chance to alter script allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?

Relates to ggerganov/llama.cpp#7145 . I'd search for DLLAMA_HIP_UMA in the comments above as they re-compile llama with the needed flag. And yes, I'm also still on CPU-only mode here and waiting for any progress with Vulkan or this issue.

This is mitigated in 6.10 kernel, that's why I mention it (no hack needed) https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs

<!-- gh-comment-id:2273040492 --> @MaciejMogilany commented on GitHub (Aug 7, 2024): > > **It's there any chance to alter script allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?** > > Relates to [ggerganov/llama.cpp#7145](https://github.com/ggerganov/llama.cpp/issues/7145) . I'd search for `DLLAMA_HIP_UMA` in the comments above as they re-compile llama with the needed flag. And yes, I'm also still on CPU-only mode here and waiting for any progress with Vulkan or this issue. This is mitigated in 6.10 kernel, that's why I mention it (no hack needed) https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs
Author
Owner

@MaciejMogilany commented on GitHub (Aug 7, 2024):

It's there any chance to alter script allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?

Relates to ggerganov/llama.cpp#7145 . I'd search for DLLAMA_HIP_UMA in the comments above as they re-compile llama with the needed flag. And yes, I'm also still on CPU-only mode here and waiting for any progress with Vulkan or this issue.

This is mitigated in 6.10 kernel, that's why I mention it (no hack needed) https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs

I test ollama with compile flag DLLAMA_HIP_UMA and it allow all GTT memory to be used on 6.10 kernel

<!-- gh-comment-id:2273240383 --> @MaciejMogilany commented on GitHub (Aug 7, 2024): > > > **It's there any chance to alter script allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?** > > > > > > Relates to [ggerganov/llama.cpp#7145](https://github.com/ggerganov/llama.cpp/issues/7145) . I'd search for `DLLAMA_HIP_UMA` in the comments above as they re-compile llama with the needed flag. And yes, I'm also still on CPU-only mode here and waiting for any progress with Vulkan or this issue. > > This is mitigated in 6.10 kernel, that's why I mention it (no hack needed) https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs I test ollama with compile flag DLLAMA_HIP_UMA and it allow all GTT memory to be used on 6.10 kernel
Author
Owner

@sebastian-philipp commented on GitHub (Aug 7, 2024):

I test ollama with compile flag DLLAMA_HIP_UMA and it allow all GTT memory to be used on 6.10 kernel

still pretty unfortunate that llama needs to be recompiled for this.

<!-- gh-comment-id:2273281765 --> @sebastian-philipp commented on GitHub (Aug 7, 2024): > I test ollama with compile flag DLLAMA_HIP_UMA and it allow all GTT memory to be used on 6.10 kernel still pretty unfortunate that llama needs to be recompiled for this.
Author
Owner

@sebastian-philipp commented on GitHub (Aug 7, 2024):

pretty please @Djip007

<!-- gh-comment-id:2273288346 --> @sebastian-philipp commented on GitHub (Aug 7, 2024): pretty please @Djip007
Author
Owner

@MaciejMogilany commented on GitHub (Aug 9, 2024):

It's there any chance to alter script allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?

Relates to ggerganov/llama.cpp#7145 . I'd search for DLLAMA_HIP_UMA in the comments above as they re-compile llama with the needed flag. And yes, I'm also still on CPU-only mode here and waiting for any progress with Vulkan or this issue.

This is mitigated in 6.10 kernel, that's why I mention it (no hack needed) https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs

I test ollama with compile flag DLLAMA_HIP_UMA and it allow all GTT memory to be used on 6.10 kernel

I made a mistake dont't know why one compilation gave me full GTT memory as GPU memory on ollama. I made so many compilations and tinkering in code.

I clone a fresh copy of Ollama and got the results:

ollama v0.1.3.4 commit de4fc29 and llama.cpp commit 1e6f6544 aug 6 2024 no UMA applied Ollama sees only 16GB GPU memory (I set it in bios). Load LLM model to GTT memory on kernel 6.10

ollama v0.1.3.4 commit de4fc29 and llama.cpp commit 1e6f6544 aug 6 2024 with flag -DGGML_HIP_UMA=on
Ollama sees only 16GB GPU memory, amdgpu_top doesn't see GTT or VRAM memory filled when LLM model is loaded.

On 6.10 kernel DGGML_HIP_UMA=on is not needed to use shared GTT memory.

It's there any chance to alter script gpu/amd_linux.go allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?

<!-- gh-comment-id:2277159105 --> @MaciejMogilany commented on GitHub (Aug 9, 2024): > > > > **It's there any chance to alter script allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?** > > > > > > > > > Relates to [ggerganov/llama.cpp#7145](https://github.com/ggerganov/llama.cpp/issues/7145) . I'd search for `DLLAMA_HIP_UMA` in the comments above as they re-compile llama with the needed flag. And yes, I'm also still on CPU-only mode here and waiting for any progress with Vulkan or this issue. > > > > > > This is mitigated in 6.10 kernel, that's why I mention it (no hack needed) https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs > > I test ollama with compile flag DLLAMA_HIP_UMA and it allow all GTT memory to be used on 6.10 kernel I made a mistake dont't know why one compilation gave me full GTT memory as GPU memory on ollama. I made so many compilations and tinkering in code. I clone a fresh copy of Ollama and got the results: ollama v0.1.3.4 commit de4fc29 and llama.cpp commit 1e6f6544 aug 6 2024 no UMA applied Ollama sees only 16GB GPU memory (I set it in bios). Load LLM model to GTT memory on kernel 6.10 ollama v0.1.3.4 commit de4fc29 and llama.cpp commit 1e6f6544 aug 6 2024 with flag -DGGML_HIP_UMA=on Ollama sees only 16GB GPU memory, amdgpu_top doesn't see GTT or VRAM memory filled when LLM model is loaded. On 6.10 kernel DGGML_HIP_UMA=on is not needed to use shared GTT memory. It's there any chance to alter script gpu/amd_linux.go allocating total memory for [APU on 6.10+ kernels ](https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs)based on GTT not VRAM?
Author
Owner

@MaciejMogilany commented on GitHub (Aug 9, 2024):

It's there any chance to alter script gpu/amd_linux.go allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?

Made PR that enable full GTT allocation on gfx1103, gfx1035 APUs on 6.10+ kernels

<!-- gh-comment-id:2277677927 --> @MaciejMogilany commented on GitHub (Aug 9, 2024): > It's there any chance to alter script gpu/amd_linux.go allocating total memory for [APU on 6.10+ kernels ](https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs)based on GTT not VRAM? Made [PR](https://github.com/ollama/ollama/pull/6282/commits/ed05e507bcd9129bee7e7455c64030343351ef7f) that enable full GTT allocation on gfx1103, gfx1035 APUs on 6.10+ kernels
Author
Owner

@Djip007 commented on GitHub (Aug 14, 2024):

As @MaciejMogilany report on linux:

  • if you use GGML_HIP_UMA=on llama.cpp use RAM not VRAM or GTT.
  • without and with kernel 6.10+ it use VRAM+ GGT
  • without an kernel <6.10 llama.cpp use only VRAM.

And you can ajust GTT with linux boot : amdgpu.gttsize=65536 allow use of 64Go of GTT (by default it is 1/2 of RAM)

(for exemple @MaciejMogilany can reduce BIOS VRAM to 2/4Go, and configure amdgpu.gttsize=81920 to have 80Go of GTT.)

<!-- gh-comment-id:2288238339 --> @Djip007 commented on GitHub (Aug 14, 2024): As @MaciejMogilany report on linux: - if you use `GGML_HIP_UMA=on` llama.cpp use RAM not VRAM or GTT. - without and with kernel 6.10+ it use ~~VRAM+~~ GGT - without an kernel <6.10 llama.cpp use only VRAM. And you can ajust GTT with linux boot : `amdgpu.gttsize=65536` allow use of 64Go of GTT (by default it is 1/2 of RAM) (for exemple @MaciejMogilany can reduce BIOS VRAM to 2/4Go, and configure `amdgpu.gttsize=81920` to have 80Go of GTT.)
Author
Owner

@MaciejMogilany commented on GitHub (Aug 14, 2024):

As @MaciejMogilany report on linux:

  • without and with kernel 6.10+ it use VRAM+GGT

only GTT

And you can ajust GTT with linux boot : amdgpu.gttsize=65536 allow use of 64Go of GTT (by default it is 1/2 of RAM)

(for exemple @MaciejMogilany can reduce BIOS VRAM to 2/4Go, and configure amdgpu.gttsize=81920 to have 80Go of GTT.)

If the model exceeds 50GB (or maybe is bigger than 1/2 of RAM) ollama becomes unstable. Is increasing GTT size fasiable? Mayby if you need to fit few small models with small context windows.

<!-- gh-comment-id:2288412284 --> @MaciejMogilany commented on GitHub (Aug 14, 2024): > As @MaciejMogilany report on linux: > > * without and with kernel 6.10+ it use VRAM+GGT only GTT > And you can ajust GTT with linux boot : `amdgpu.gttsize=65536` allow use of 64Go of GTT (by default it is 1/2 of RAM) > > (for exemple @MaciejMogilany can reduce BIOS VRAM to 2/4Go, and configure `amdgpu.gttsize=81920` to have 80Go of GTT.) If the model exceeds 50GB (or maybe is bigger than 1/2 of RAM) ollama becomes unstable. Is increasing GTT size fasiable? Mayby if you need to fit few small models with small context windows.
Author
Owner

@Djip007 commented on GitHub (Aug 14, 2024):

If the model exceeds 50GB (or maybe is bigger than 1/2 of RAM) ollama becomes unstable. Is increasing GTT size fasiable? Mayby if you need to fit few small models with small context windows.

GGT size can be increasing with boot kernel param : amdgpu.gttsize=NNN in MByte.
I always thinks it is rocblas that is unstable (because unsuported?) I have to made more test with llamafile.

<!-- gh-comment-id:2288646963 --> @Djip007 commented on GitHub (Aug 14, 2024): > > > > If the model exceeds 50GB (or maybe is bigger than 1/2 of RAM) ollama becomes unstable. Is increasing GTT size fasiable? Mayby if you need to fit few small models with small context windows. GGT size can be increasing with boot kernel param : `amdgpu.gttsize=NNN` in MByte. I always thinks it is rocblas that is unstable (because unsuported?) I have to made more test with llamafile.
Author
Owner

@Djip007 commented on GitHub (Aug 14, 2024):

Strange...
I allocate 56Go of GGT:

$ sudo dmesg | grep "\[drm\] amdgpu:"
[    4.207199] [drm] amdgpu: 2048M of VRAM memory ready
[    4.207202] [drm] amdgpu: 57344M of GTT memory ready.

(test with llamafile, not ollama)
But don't look I can use more GTT than 1/2 of RAM for hip_alloc... (Linux framework 6.10.3-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 5 14:30:00 UTC 2024 x86_64 GNU/Linux)
And have sometime crache even with small model (Mistral-7B-Instruct-v0.3.F16.gguf) and tinyblas (ie without rocblas)
UMA can alloc more RAM.

May be a "bug" in amdgpu/kfd driver.

(Note: test with --no-mmap )

<!-- gh-comment-id:2288909372 --> @Djip007 commented on GitHub (Aug 14, 2024): Strange... I allocate 56Go of GGT: ```sh $ sudo dmesg | grep "\[drm\] amdgpu:" [ 4.207199] [drm] amdgpu: 2048M of VRAM memory ready [ 4.207202] [drm] amdgpu: 57344M of GTT memory ready. ``` (test with llamafile, not ollama) But don't look I can use more GTT than 1/2 of RAM for hip_alloc... (Linux framework 6.10.3-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 5 14:30:00 UTC 2024 x86_64 GNU/Linux) And have sometime crache even with small model (Mistral-7B-Instruct-v0.3.F16.gguf) and tinyblas (ie without rocblas) UMA can alloc more RAM. May be a "bug" in amdgpu/kfd driver. (Note: test with `--no-mmap` )
Author
Owner

@Djip007 commented on GitHub (Aug 20, 2024):

OK I have somethings now that "work":

  • OS: Fedora 40, kernel 6.10.5, mesa 24.1.6 (may not be needed, but there is some other bug in prevus release)
  • Hardware: Ryzen 7940HS / 64Go RAM
  • TTM config;
# cat /etc/modprobe.d/ttm.conf
# nb of pages 4k, for 48Go
options ttm pages_limit=12582912
options ttm page_pool_size=12582912

now:

$ cat /sys/module/ttm/parameters/pages_limit
12582912
$ cat /sys/module/ttm/parameters/page_pool_size
12582912

sudo dmesg | grep "memory ready"
[    2.995025] [drm] amdgpu: 2048M of VRAM memory ready
[    2.995028] [drm] amdgpu: 49152M of GTT memory ready.

instal and activate rocblas-gfx1103: module load rocm/gfx1103
I test with llamafile-0.8.13(pach) and Meta-Llama-3-70B-Instruct.Q4_K_M.gguf (42520393152 size)

llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:      ROCm0 buffer size = 39979.48 MiB
llm_load_tensors:  ROCm_Host buffer size =   563.62 MiB
.....
llama_kv_cache_init:      ROCm0 KV buffer size =  2560.00 MiB
llama_new_context_with_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   276.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     8.00 MiB

so it succeed to load the full model on GPU/GTT. I do not have been limited by 1/2 of RAM... look to have allocate ~42Go on GTT.

I do not have ollama, but it may be good news!

<!-- gh-comment-id:2299814665 --> @Djip007 commented on GitHub (Aug 20, 2024): OK I have somethings now that "work": - OS: Fedora 40, kernel 6.10.5, mesa 24.1.6 (may not be needed, but there is some other bug in prevus release) - Hardware: Ryzen 7940HS / 64Go RAM - TTM config; ``` # cat /etc/modprobe.d/ttm.conf # nb of pages 4k, for 48Go options ttm pages_limit=12582912 options ttm page_pool_size=12582912 ``` now: ``` $ cat /sys/module/ttm/parameters/pages_limit 12582912 $ cat /sys/module/ttm/parameters/page_pool_size 12582912 sudo dmesg | grep "memory ready" [ 2.995025] [drm] amdgpu: 2048M of VRAM memory ready [ 2.995028] [drm] amdgpu: 49152M of GTT memory ready. ``` instal and activate rocblas-gfx1103: `module load rocm/gfx1103` I test with llamafile-0.8.13(pach) and Meta-Llama-3-70B-Instruct.Q4_K_M.gguf (42520393152 size) ``` llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: ROCm0 buffer size = 39979.48 MiB llm_load_tensors: ROCm_Host buffer size = 563.62 MiB ..... llama_kv_cache_init: ROCm0 KV buffer size = 2560.00 MiB llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.49 MiB llama_new_context_with_model: ROCm0 compute buffer size = 276.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 8.00 MiB ``` so it succeed to load the full model on GPU/GTT. I do not have been limited by 1/2 of RAM... look to have allocate ~42Go on GTT. I do not have ollama, but it may be good news!
Author
Owner

@MaciejMogilany commented on GitHub (Aug 23, 2024):

I test with llamafile-0.8.13(pach) and Meta-Llama-3-70B-Instruct.Q4_K_M.gguf (42520393152 size)

llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:      ROCm0 buffer size = 39979.48 MiB
llm_load_tensors:  ROCm_Host buffer size =   563.62 MiB
.....
llama_kv_cache_init:      ROCm0 KV buffer size =  2560.00 MiB
llama_new_context_with_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   276.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     8.00 MiB

so it succeed to load the full model on GPU/GTT. I do not have been limited by 1/2 of RAM... look to have allocate ~42Go on GTT.

I do not have ollama, but it may be good news!

With

cat /sys/module/ttm/parameters/pages_limit
20971520

Around 80GB of GTT

sudo dmesg | grep "memory ready"
[    4.171529] [drm] amdgpu: 2048M of VRAM memory ready
[    4.171530] [drm] amdgpu: 81920M of GTT memory ready.

uname -r
6.10.6-arch1-1

lscpu | grep -i "Nazwa modelu"
Nazwa modelu:                     AMD Ryzen 7 7840HS w/ Radeon 780M Graphics

the biggest model I was able to load was mistral-large:123b-instruct-2407-q3_K_S

...
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q3_K - Small
llm_load_print_meta: model params     = 122.61 B
llm_load_print_meta: model size       = 49.22 GiB (3.45 BPW) 
llm_load_print_meta: general.name     = Mistral Large Instruct 2407
...
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.87 MiB
llm_load_tensors: offloading 88 repeating layers to GPU
llm_load_tensors: offloaded 88/89 layers to GPU
llm_load_tensors:      ROCm0 buffer size = 49920.75 MiB
llm_load_tensors:  ROCm_Host buffer size =   480.05 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llampaginga_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size = 11264.00 MiB
llama_new_context_with_model: KV self size  = 11264.00 MiB, K (f16): 5632.00 MiB, V (f16): 5632.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =  6304.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    88.01 MiB

it takes around 71GB of RAM

cat /sys/class/drm/card1/device/mem_info_gtt_used 
71295127552

but only with llamafile and --no-mmap flag

This same model with mmap almost doubles the memory requirement as it adds 50GB CPU buffer that make paging memory indefinitely

llm_load_tensors:      ROCm0 buffer size = 49920.75 MiB
llm_load_tensors:        CPU buffer size = 50400.80 MiB

ollama doesn't have the ability to pas --no-mmap to llama.cpp. The maximum I was able to load was mistral-large:123b-instruct-2407-q2_K (47 GB of GTT) and often crashed with bigger context windows. Max memory on small APU is 96GB. 47GB model in GTT + CPU buffer around 40GB + system +KV buffer = :( Thre is future requests fo add no-mmap to ollama

Another thing is instability. The worst offenders are gemma2 models. gemma2 27B crash graphic on load with 50% probability on ollama but lama3.1 70B q4 run solid.

<!-- gh-comment-id:2306976825 --> @MaciejMogilany commented on GitHub (Aug 23, 2024): > I test with llamafile-0.8.13(pach) and Meta-Llama-3-70B-Instruct.Q4_K_M.gguf (42520393152 size) > > ``` > llm_load_tensors: offloaded 81/81 layers to GPU > llm_load_tensors: ROCm0 buffer size = 39979.48 MiB > llm_load_tensors: ROCm_Host buffer size = 563.62 MiB > ..... > llama_kv_cache_init: ROCm0 KV buffer size = 2560.00 MiB > llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB > llama_new_context_with_model: ROCm_Host output buffer size = 0.49 MiB > llama_new_context_with_model: ROCm0 compute buffer size = 276.00 MiB > llama_new_context_with_model: ROCm_Host compute buffer size = 8.00 MiB > ``` > > so it succeed to load the full model on GPU/GTT. I do not have been limited by 1/2 of RAM... look to have allocate ~42Go on GTT. > > I do not have ollama, but it may be good news! With ``` cat /sys/module/ttm/parameters/pages_limit 20971520 ``` Around 80GB of GTT ``` sudo dmesg | grep "memory ready" [ 4.171529] [drm] amdgpu: 2048M of VRAM memory ready [ 4.171530] [drm] amdgpu: 81920M of GTT memory ready. uname -r 6.10.6-arch1-1 lscpu | grep -i "Nazwa modelu" Nazwa modelu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics ``` the biggest model I was able to load was **mistral-large:123b-instruct-2407-q3_K_S** ``` ... llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q3_K - Small llm_load_print_meta: model params = 122.61 B llm_load_print_meta: model size = 49.22 GiB (3.45 BPW) llm_load_print_meta: general.name = Mistral Large Instruct 2407 ... ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no llm_load_tensors: ggml ctx size = 0.87 MiB llm_load_tensors: offloading 88 repeating layers to GPU llm_load_tensors: offloaded 88/89 layers to GPU llm_load_tensors: ROCm0 buffer size = 49920.75 MiB llm_load_tensors: ROCm_Host buffer size = 480.05 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llampaginga_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 11264.00 MiB llama_new_context_with_model: KV self size = 11264.00 MiB, K (f16): 5632.00 MiB, V (f16): 5632.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB llama_new_context_with_model: ROCm0 compute buffer size = 6304.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 88.01 MiB ``` it takes around 71GB of RAM ``` cat /sys/class/drm/card1/device/mem_info_gtt_used 71295127552 ``` but only with llamafile and --no-mmap flag This same model with mmap almost doubles the memory requirement as it adds 50GB CPU buffer that make paging memory indefinitely ``` llm_load_tensors: ROCm0 buffer size = 49920.75 MiB llm_load_tensors: CPU buffer size = 50400.80 MiB ``` ollama doesn't have the ability to pas --no-mmap to llama.cpp. The maximum I was able to load was mistral-large:123b-instruct-2407-q2_K (47 GB of GTT) and often crashed with bigger context windows. Max memory on small APU is 96GB. 47GB model in GTT + CPU buffer around 40GB + system +KV buffer = :( Thre is[ future requests](https://github.com/ollama/ollama/issues/4895) fo add no-mmap to ollama Another thing is instability. The worst offenders are gemma2 models. gemma2 27B crash graphic on load with 50% probability on ollama but lama3.1 70B q4 run solid.
Author
Owner

@Djip007 commented on GitHub (Aug 24, 2024):

--no-mmap flag

Ho sorry, I forgot to report that... Yes it is needed (until we manage to get direct acces on it (there was a test, so I know it can work...)

Max memory on small APU is 96GB.

well... not really:

  • ASRock DeskMeet X600
  • AMD Ryzen 7 8700G
    you have 4 DDR so 192Go is possible 😉

Another thing is instability. The worst offenders are gemma2 models. gemma2 27B crash graphic on load with 50% probability on ollama but lama3.1 70B q4 run solid.

Yes... I don't understand why... don't know how to debug that, and no idea what it can be. May be I can find some time to create specific backend for RDNA3 APU...

<!-- gh-comment-id:2307973098 --> @Djip007 commented on GitHub (Aug 24, 2024): > --no-mmap flag Ho sorry, I forgot to report that... Yes it is needed (until we manage to get direct acces on it (there was a test, so I know it can work...) > Max memory on small APU is 96GB. well... not really: - ASRock DeskMeet X600 - AMD Ryzen 7 8700G you have 4 DDR so 192Go is possible :wink: > Another thing is instability. The worst offenders are gemma2 models. gemma2 27B crash graphic on load with 50% probability on ollama but lama3.1 70B q4 run solid. Yes... I don't understand why... don't know how to debug that, and no idea what it can be. May be I can find some time to create specific backend for RDNA3 APU...
Author
Owner

@rmcmilli commented on GitHub (Aug 25, 2024):

It's there any chance to alter script gpu/amd_linux.go allocating total memory for APU on 6.10+ kernels based on GTT not VRAM?

Made PR that enable full GTT allocation on gfx1103, gfx1035 APUs on 6.10+ kernels

Added the patch from your PR and it works great so far on my AMD 7840U. Thanks!

<!-- gh-comment-id:2308645078 --> @rmcmilli commented on GitHub (Aug 25, 2024): > > It's there any chance to alter script gpu/amd_linux.go allocating total memory for [APU on 6.10+ kernels ](https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs)based on GTT not VRAM? > > Made [PR](https://github.com/ollama/ollama/pull/6282/commits/ed05e507bcd9129bee7e7455c64030343351ef7f) that enable full GTT allocation on gfx1103, gfx1035 APUs on 6.10+ kernels Added the patch from your PR and it works great so far on my AMD 7840U. Thanks!
Author
Owner

@MaciejMogilany commented on GitHub (Aug 26, 2024):

Made PR that enable full GTT allocation on gfx1103, gfx1035 APUs on 6.10+ kernels

Added the patch from your PR and it works great so far on my AMD 7840U. Thanks!

Can You test again with last commits? I have added --no-mmap for APU and subtracted GTT from RAM size. For the first time, I was able to load to GPU (partially) mistral-large:123b-instruct-2407-q4 without hang or crash of ollama. I am not sure if this is a correct approach, but from initial testing ollama seems more robust.

Default GTT runs more stable for me

<!-- gh-comment-id:2310554293 --> @MaciejMogilany commented on GitHub (Aug 26, 2024): > > Made [PR](https://github.com/ollama/ollama/pull/6282/commits/ed05e507bcd9129bee7e7455c64030343351ef7f) that enable full GTT allocation on gfx1103, gfx1035 APUs on 6.10+ kernels > > Added the patch from your PR and it works great so far on my AMD 7840U. Thanks! Can You test again with last commits? I have added --no-mmap for APU and subtracted GTT from RAM size. For the first time, I was able to load to GPU (partially) mistral-large:123b-instruct-2407-q4 without hang or crash of ollama. I am not sure if this is a correct approach, but from initial testing ollama seems more robust. Default GTT runs more stable for me
Author
Owner

@N4S4 commented on GitHub (Aug 26, 2024):

Hello,
not sure if this is the right place, I am experiencing an issue where when setting the override with systemd edit ollama.service it dores not take place after restart on ubuntu 24, ryzen 9 7940hs radeon 780m

output from journalctl -u ollama --no-pager

ollama[49742]: 2024/08/25 08:15:47 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:*

HSA_OVERRIDE_GFX_VERSION: remains empty

could someone help me?

<!-- gh-comment-id:2310785399 --> @N4S4 commented on GitHub (Aug 26, 2024): Hello, not sure if this is the right place, I am experiencing an issue where when setting the override with systemd edit ollama.service it dores not take place after restart on ubuntu 24, ryzen 9 7940hs radeon 780m output from journalctl -u ollama --no-pager ollama[49742]: 2024/08/25 08:15:47 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* HSA_OVERRIDE_GFX_VERSION: remains empty could someone help me?
Author
Owner

@MaciejMogilany commented on GitHub (Aug 26, 2024):

not sure if this is the right place, I am experiencing an issue where when setting the override with systemd edit ollama.service it dores not take place after restart on ubuntu 24, ryzen 9 7940hs radeon 780m

output from journalctl -u ollama --no-pager

ollama[49742]: 2024/08/25 08:15:47 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:*

HSA_OVERRIDE_GFX_VERSION: remains empty

could someone help me?

sudo systemctl edit ollama.service

### Anything between here and the comment below will become the contents of the drop-in file

Environment="HSA_OVERRIDE_GFX_VERSION==11.0.1"

sudo systemctl daemon-reload
sudo systemctl restart ollama.service
<!-- gh-comment-id:2310847788 --> @MaciejMogilany commented on GitHub (Aug 26, 2024): > not sure if this is the right place, I am experiencing an issue where when setting the override with systemd edit ollama.service it dores not take place after restart on ubuntu 24, ryzen 9 7940hs radeon 780m > > output from journalctl -u ollama --no-pager > > ollama[49742]: 2024/08/25 08:15:47 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* > > HSA_OVERRIDE_GFX_VERSION: remains empty > > could someone help me? ``` sudo systemctl edit ollama.service ### Anything between here and the comment below will become the contents of the drop-in file Environment="HSA_OVERRIDE_GFX_VERSION==11.0.1" sudo systemctl daemon-reload sudo systemctl restart ollama.service ```
Author
Owner

@rmcmilli commented on GitHub (Aug 26, 2024):

Made PR that enable full GTT allocation on gfx1103, gfx1035 APUs on 6.10+ kernels

Added the patch from your PR and it works great so far on my AMD 7840U. Thanks!

Can You test again with last commits? I have added --no-mmap for APU and subtracted GTT from RAM size. For the first time, I was able to load to GPU (partially) mistral-large:123b-instruct-2407-q4 without hang or crash of ollama. I am not sure if this is a correct approach, but from initial testing ollama seems more robust.

Default GTT runs more stable for me

I've rebuilt with the new commit but now i'm having trouble getting HSA_OVERRIDE_GFX_VERSION to take. Using the last image I built looks fine but the new one doesn't ignore the compatibility check with the environment variable.

I'm using docker btw so it might be a local issue.

Edit: Confirmed local issues. Rebuilt and running now without issue.

<!-- gh-comment-id:2310921439 --> @rmcmilli commented on GitHub (Aug 26, 2024): > > > Made [PR](https://github.com/ollama/ollama/pull/6282/commits/ed05e507bcd9129bee7e7455c64030343351ef7f) that enable full GTT allocation on gfx1103, gfx1035 APUs on 6.10+ kernels > > > > > > Added the patch from your PR and it works great so far on my AMD 7840U. Thanks! > > Can You test again with last commits? I have added --no-mmap for APU and subtracted GTT from RAM size. For the first time, I was able to load to GPU (partially) mistral-large:123b-instruct-2407-q4 without hang or crash of ollama. I am not sure if this is a correct approach, but from initial testing ollama seems more robust. > > Default GTT runs more stable for me I've rebuilt with the new commit but now i'm having trouble getting `HSA_OVERRIDE_GFX_VERSION` to take. Using the last image I built looks fine but the new one doesn't ignore the compatibility check with the environment variable. I'm using docker btw so it might be a local issue. Edit: Confirmed local issues. Rebuilt and running now without issue.
Author
Owner

@N4S4 commented on GitHub (Aug 26, 2024):

not sure if this is the right place, I am experiencing an issue where when setting the override with systemd edit ollama.service it dores not take place after restart on ubuntu 24, ryzen 9 7940hs radeon 780m
output from journalctl -u ollama --no-pager
ollama[49742]: 2024/08/25 08:15:47 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:*
HSA_OVERRIDE_GFX_VERSION: remains empty
could someone help me?

sudo systemctl edit ollama.service

### Anything between here and the comment below will become the contents of the drop-in file

Environment="HSA_OVERRIDE_GFX_VERSION==11.0.1"

sudo systemctl daemon-reload
sudo systemctl restart ollama.service

strange i did the same steps before but now worked, variable is set

now i get from journalctl -u ollama.service | grep gpu output:

ago 26 21:32:24 AI-Server ollama[156570]: time=2024-08-26T21:32:24.760+02:00 level=INFO source=gpu.go:350 msg="no compatible GPUs were discovered"
ago 26 21:33:42 AI-Server ollama[156570]: time=2024-08-26T21:33:42.669+02:00 level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama2990594358/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-633fc5be925f9a484b61d6f9b9a78021eeb462100bd557309f01ba84cac26adf --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 1 --threads 14 --no-mmap --parallel 4 --port 39055"
ago 26 21:33:42 AI-Server ollama[156841]: WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="132482596353920" timestamp=1724700822

set DEBUG

INFO source=amd_linux.go:274 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB"

not sure why is skipping my igpu any idea?

<!-- gh-comment-id:2310944756 --> @N4S4 commented on GitHub (Aug 26, 2024): > > not sure if this is the right place, I am experiencing an issue where when setting the override with systemd edit ollama.service it dores not take place after restart on ubuntu 24, ryzen 9 7940hs radeon 780m > > output from journalctl -u ollama --no-pager > > ollama[49742]: 2024/08/25 08:15:47 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* > > HSA_OVERRIDE_GFX_VERSION: remains empty > > could someone help me? > > ``` > sudo systemctl edit ollama.service > > ### Anything between here and the comment below will become the contents of the drop-in file > > Environment="HSA_OVERRIDE_GFX_VERSION==11.0.1" > > sudo systemctl daemon-reload > sudo systemctl restart ollama.service > ``` strange i did the same steps before but now worked, variable is set now i get from journalctl -u ollama.service | grep gpu output: ``` ago 26 21:32:24 AI-Server ollama[156570]: time=2024-08-26T21:32:24.760+02:00 level=INFO source=gpu.go:350 msg="no compatible GPUs were discovered" ago 26 21:33:42 AI-Server ollama[156570]: time=2024-08-26T21:33:42.669+02:00 level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama2990594358/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-633fc5be925f9a484b61d6f9b9a78021eeb462100bd557309f01ba84cac26adf --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 1 --threads 14 --no-mmap --parallel 4 --port 39055" ago 26 21:33:42 AI-Server ollama[156841]: WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="132482596353920" timestamp=1724700822 ``` set DEBUG `INFO source=amd_linux.go:274 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB" ` not sure why is skipping my igpu any idea?
Author
Owner

@MaciejMogilany commented on GitHub (Aug 26, 2024):

This is not a place for this.

You may go to your bios and increase VRAM size as ollama officially
bypasses integrated graphics. It does this by checking if VRAM is smaller
than 1GB. And AMD set VRAM to 512MB by default. Another way is to install
6.9.9+ kernel and compile ollama from this draft
ed05e507bc
then changing VRAM in bios is not needed

pon., 26 sie 2024, 21:40 użytkownik Renato @.***>
napisał:

not sure if this is the right place, I am experiencing an issue where when
setting the override with systemd edit ollama.service it dores not take
place after restart on ubuntu 24, ryzen 9 7940hs radeon 780m
output from journalctl -u ollama --no-pager
ollama[49742]: 2024/08/25 08:15:47 routes.go:1125: INFO server config
env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES:
HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false
OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false
OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0
OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models
OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0
OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:*
https://localhost:*
HSA_OVERRIDE_GFX_VERSION: remains empty
could someone help me?

sudo systemctl edit ollama.service

Anything between here and the comment below will become the contents of the drop-in file

Environment="HSA_OVERRIDE_GFX_VERSION==11.0.1"

sudo systemctl daemon-reload
sudo systemctl restart ollama.service

strange i did the same steps before but now worked, variable is set

now i get from journalctl -u ollama.service | grep gpu output:

ago 26 21:32:24 AI-Server ollama[156570]: time=2024-08-26T21:32:24.760+02:00 level=INFO source=gpu.go:350 msg="no compatible GPUs were discovered"
ago 26 21:33:42 AI-Server ollama[156570]: time=2024-08-26T21:33:42.669+02:00 level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama2990594358/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-633fc5be925f9a484b61d6f9b9a78021eeb462100bd557309f01ba84cac26adf --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 1 --threads 14 --no-mmap --parallel 4 --port 39055"
ago 26 21:33:42 AI-Server ollama[156841]: WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="132482596353920" timestamp=1724700822


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/2637#issuecomment-2310944756,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ANORXN2QIIIXN7BCTK3AKSTZTOAEVAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQHE2DINZVGY
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:2310986108 --> @MaciejMogilany commented on GitHub (Aug 26, 2024): This is not a place for this. You may go to your bios and increase VRAM size as ollama officially bypasses integrated graphics. It does this by checking if VRAM is smaller than 1GB. And AMD set VRAM to 512MB by default. Another way is to install 6.9.9+ kernel and compile ollama from this draft https://github.com/ollama/ollama/pull/6282/commits/ed05e507bcd9129bee7e7455c64030343351ef7f then changing VRAM in bios is not needed pon., 26 sie 2024, 21:40 użytkownik Renato ***@***.***> napisał: > not sure if this is the right place, I am experiencing an issue where when > setting the override with systemd edit ollama.service it dores not take > place after restart on ubuntu 24, ryzen 9 7940hs radeon 780m > output from journalctl -u ollama --no-pager > ollama[49742]: 2024/08/25 08:15:47 routes.go:1125: INFO server config > env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: > HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false > OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false > OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 > OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models > OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 > OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* > https://localhost:* > HSA_OVERRIDE_GFX_VERSION: remains empty > could someone help me? > > sudo systemctl edit ollama.service > > ### Anything between here and the comment below will become the contents of the drop-in file > > Environment="HSA_OVERRIDE_GFX_VERSION==11.0.1" > > sudo systemctl daemon-reload > sudo systemctl restart ollama.service > > strange i did the same steps before but now worked, variable is set > > now i get from journalctl -u ollama.service | grep gpu output: > > ago 26 21:32:24 AI-Server ollama[156570]: time=2024-08-26T21:32:24.760+02:00 level=INFO source=gpu.go:350 msg="no compatible GPUs were discovered" > ago 26 21:33:42 AI-Server ollama[156570]: time=2024-08-26T21:33:42.669+02:00 level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama2990594358/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-633fc5be925f9a484b61d6f9b9a78021eeb462100bd557309f01ba84cac26adf --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 1 --threads 14 --no-mmap --parallel 4 --port 39055" > ago 26 21:33:42 AI-Server ollama[156841]: WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="132482596353920" timestamp=1724700822 > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/2637#issuecomment-2310944756>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ANORXN2QIIIXN7BCTK3AKSTZTOAEVAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQHE2DINZVGY> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@Xeroxxx commented on GitHub (Sep 15, 2024):

on ollama version 0.3.2 with kernel 6.10.3-3 on Arch based distro and ollama Environment variables as below

Environment="OLLAMA_HOST=0.0.0.0" Environment="HSA_OVERRIDE_GFX_VERSION=11.0.1" Environment="HSA_ENABLE_SDMA=0" Environment="HCC_AMDGPU_TARGET=gfx1101" Environment="OLLAMA_FLASH_ATTENTION=1"

This makes it working for me on my AMD Ryzen 5 Pro 2400GE! Thank you. Just need to increase VRAM in BIOS.

Running on three Node Kubernetes with AMD Device Plugin.

<!-- gh-comment-id:2351554025 --> @Xeroxxx commented on GitHub (Sep 15, 2024): > on ollama version 0.3.2 with kernel 6.10.3-3 on Arch based distro and ollama Environment variables as below > > Environment="OLLAMA_HOST=0.0.0.0" Environment="HSA_OVERRIDE_GFX_VERSION=11.0.1" Environment="HSA_ENABLE_SDMA=0" Environment="HCC_AMDGPU_TARGET=gfx1101" Environment="OLLAMA_FLASH_ATTENTION=1" This makes it working for me on my AMD Ryzen 5 Pro 2400GE! Thank you. Just need to increase VRAM in BIOS. Running on three Node Kubernetes with AMD Device Plugin.
Author
Owner

@yookoala commented on GitHub (Oct 4, 2024):

Is there a way to force the shared memory allocation so model can be run on GPU instead of CPU?

I'm running a Ryzen 8700g with 96GB RAM. But I cannot seem to run a ~40GB model on GPU.

<!-- gh-comment-id:2393836070 --> @yookoala commented on GitHub (Oct 4, 2024): Is there a way to force the shared memory allocation so model can be run on GPU instead of CPU? I'm running a Ryzen 8700g with 96GB RAM. But I cannot seem to run a ~40GB model on GPU.
Author
Owner

@MaciejMogilany commented on GitHub (Oct 4, 2024):

compile this Branch
https://github.com/ollama/ollama/pull/6282#issue-2457641055 everything is
in PR description

pt., 4 paź 2024, 16:23 użytkownik Koala Yeung @.***>
napisał:

Is there a way to force the shared memory allocation so model can be run
on GPU instead of CPU?

I'm running a Ryzen 8700g with 96GB RAM. But I cannot seem to run a ~40GB
model on GPU.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/2637#issuecomment-2393836070,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ANORXN7PH3IIN2362GDC6Y3ZZ2QHVAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJTHAZTMMBXGA
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:2393846476 --> @MaciejMogilany commented on GitHub (Oct 4, 2024): compile this Branch https://github.com/ollama/ollama/pull/6282#issue-2457641055 everything is in PR description pt., 4 paź 2024, 16:23 użytkownik Koala Yeung ***@***.***> napisał: > Is there a way to force the shared memory allocation so model can be run > on GPU instead of CPU? > > I'm running a Ryzen 8700g with 96GB RAM. But I cannot seem to run a ~40GB > model on GPU. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/2637#issuecomment-2393836070>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ANORXN7PH3IIN2362GDC6Y3ZZ2QHVAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJTHAZTMMBXGA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@DocMAX commented on GitHub (Oct 5, 2024):

GTT possible with Kernel 6.8.12? I am on Proxmox, thats why...

<!-- gh-comment-id:2395022329 --> @DocMAX commented on GitHub (Oct 5, 2024): GTT possible with Kernel 6.8.12? I am on Proxmox, thats why...
Author
Owner

@ericcurtin commented on GitHub (Oct 5, 2024):

Is there a way to force the shared memory allocation so model can be run on GPU instead of CPU?

I'm running a Ryzen 8700g with 96GB RAM. But I cannot seem to run a ~40GB model on GPU.

But how much VRAM do you have? GPU workloads use VRAM rather than RAM

<!-- gh-comment-id:2395077362 --> @ericcurtin commented on GitHub (Oct 5, 2024): > Is there a way to force the shared memory allocation so model can be run on GPU instead of CPU? > > I'm running a Ryzen 8700g with 96GB RAM. But I cannot seem to run a ~40GB model on GPU. But how much VRAM do you have? GPU workloads use VRAM rather than RAM
Author
Owner

@yookoala commented on GitHub (Oct 7, 2024):

Is there a way to force the shared memory allocation so model can be run on GPU instead of CPU?
I'm running a Ryzen 8700g with 96GB RAM. But I cannot seem to run a ~40GB model on GPU.

But how much VRAM do you have? GPU workloads use VRAM rather than RAM

GPU in an APU has no dedicated VRAM. It uses shared memory on the motherboard. I want to allocate as much as possible to the GPU for the use here (say 64GB).

<!-- gh-comment-id:2396405103 --> @yookoala commented on GitHub (Oct 7, 2024): > > Is there a way to force the shared memory allocation so model can be run on GPU instead of CPU? > > I'm running a Ryzen 8700g with 96GB RAM. But I cannot seem to run a ~40GB model on GPU. > > But how much VRAM do you have? GPU workloads use VRAM rather than RAM GPU in an APU has no dedicated VRAM. It uses shared memory on the motherboard. I want to allocate as much as possible to the GPU for the use here (say 64GB).
Author
Owner

@robertvazan commented on GitHub (Nov 2, 2024):

Environment="OLLAMA_HOST=0.0.0.0" Environment="HSA_OVERRIDE_GFX_VERSION=11.0.1" Environment="HSA_ENABLE_SDMA=0" Environment="HCC_AMDGPU_TARGET=gfx1101" Environment="OLLAMA_FLASH_ATTENTION=1"

This makes it working for me on my AMD Ryzen 5 Pro 2400GE! Thank you. Just need to increase VRAM in BIOS.

Are you sure this actually uses the BIOS-defined VRAM rather than GTT? Can you check it with radeontop? If I try this with fixed 8GB VRAM allocation in BIOS, Ollama will use iGPU, but the model is loaded in GTT while BIOS-configured VRAM is ignored. HSA_ENABLE_SDMA does nothing for me. If I don't reserve VRAM in BIOS, Ollama does not use the iGPU at all. I have Ryzen 5600G and Fedora.

<!-- gh-comment-id:2453076750 --> @robertvazan commented on GitHub (Nov 2, 2024): > > Environment="OLLAMA_HOST=0.0.0.0" Environment="HSA_OVERRIDE_GFX_VERSION=11.0.1" Environment="HSA_ENABLE_SDMA=0" Environment="HCC_AMDGPU_TARGET=gfx1101" Environment="OLLAMA_FLASH_ATTENTION=1" > > This makes it working for me on my AMD Ryzen 5 Pro 2400GE! Thank you. Just need to increase VRAM in BIOS. Are you sure this actually uses the BIOS-defined VRAM rather than GTT? Can you check it with `radeontop`? If I try this with fixed 8GB VRAM allocation in BIOS, Ollama will use iGPU, but the model is loaded in GTT while BIOS-configured VRAM is ignored. HSA_ENABLE_SDMA does nothing for me. If I don't reserve VRAM in BIOS, Ollama does not use the iGPU at all. I have Ryzen 5600G and Fedora.
Author
Owner

@Sebazzz commented on GitHub (Dec 4, 2024):

When I boot I see this:

[    2.080539] [drm] amdgpu: 512M of VRAM memory ready
[    2.080541] [drm] amdgpu: 31866M of GTT memory ready

Ollama won't use GTT memory? Or are you saying it needs to detect separate VRAM, but does not actually use it?

<!-- gh-comment-id:2518097084 --> @Sebazzz commented on GitHub (Dec 4, 2024): When I boot I see this: ``` [ 2.080539] [drm] amdgpu: 512M of VRAM memory ready [ 2.080541] [drm] amdgpu: 31866M of GTT memory ready ``` Ollama won't use GTT memory? Or are you saying it needs to detect separate VRAM, but does not actually use it?
Author
Owner

@robertvazan commented on GitHub (Dec 5, 2024):

@Sebazzz You have to build #6282 yourself for now to use AMD iGPU.

<!-- gh-comment-id:2519219610 --> @robertvazan commented on GitHub (Dec 5, 2024): @Sebazzz You have to build #6282 yourself for now to use AMD iGPU.
Author
Owner

@Sebazzz commented on GitHub (Dec 7, 2024):

Okay, for anyone else attempting this. I got this working on a Minisforum UM480XT (AMD Ryzen 7 4800H with Radeon Graphics; gfx90c).

image

I followed these steps:

  1. Upgraded from Ubuntu 22.04 LTS to Ubuntu 24.04 LTS with latest Zabby kernel. If you install the Zabby kernel, you may need to disable secure boot.
  2. I upgraded the system from 32GB memory to 64GB which is really surplus given what else I run
  3. Set the graphics memory in the BIOS to at least 1GB. Through ROCm/ROCm/issues/2014 I found out I could do this by entering the BIOS, then going to AMD CBS→NBIO Common Option→GFX Configuration. Then set the GFX mode to "UMD_SPECIFIED", and then set the newly appeared "frame buffer" option to at least 1GB. Because I have surplus memory, I set it to 16GB. Maybe at the point I also didn't need to compile PR #6282 but I did it anyway :-)
  4. Followed the development compile guide.
    • Don't forget the prerequisites including installing ROCm. I was not sure what I needed so I went all-in amdgpu-install --usecase=dkms,multimedia,rocm,opencl,openclsdk,mllib,hip,lrt
    • Ran the build: make -j 5; go build .
  5. Then through #7565 I found out I needed to set the LD_LIBRARY_PATH to take the compiled libggml_rocm otherwise I ran into /tmp/ollama3477896540/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory
  6. I used then the following environment variables:
OLLAMA_MAX_LOADED_MODELS=1
HSA_ENABLE_SDMA=0
HCC_AMDGPU_TARGET=gfx90c
OLLAMA_NUM_PARALLEL=1
LD_LIBRARY_PATH=<path to compiled repo>/ollama/dist/linux-amd64/lib/ollama/
HSA_OVERRIDE_GFX_VERSION=9.0.0
OLLAMA_FLASH_ATTENTION=1

When setting these vars for testing with export don't forget to quote the versions (e.g.: export HSA_OVERRIDE_GFX_VERSION="9.0.0").

Other things: amdgpu_top is a great utility to watch GPU sage.

Only thing I now need to check is how to install this in a more permanent fashion, specifically which files I need to install where.

<!-- gh-comment-id:2525068281 --> @Sebazzz commented on GitHub (Dec 7, 2024): Okay, for anyone else attempting this. I got this working on a Minisforum UM480XT (AMD Ryzen 7 4800H with Radeon Graphics; gfx90c). ![image](https://github.com/user-attachments/assets/c42e3e5a-bddd-4d00-9989-b35a5ff90663) I followed these steps: 1. Upgraded from Ubuntu 22.04 LTS to Ubuntu 24.04 LTS with latest [Zabby kernel](https://vegastack.com/tutorials/how-to-install-zabbly-kernel-on-ubuntu-22-04/). If you install the Zabby kernel, you may need to disable secure boot. 2. I upgraded the system from 32GB memory to 64GB which is really surplus given what else I run 3. Set the graphics memory in the BIOS to at least 1GB. Through ROCm/ROCm/issues/2014 I found out I could do this by entering the BIOS, then going to AMD CBS→NBIO Common Option→GFX Configuration. Then set the GFX mode to "UMD_SPECIFIED", and then set the newly appeared "frame buffer" option to at least 1GB. Because I have surplus memory, I set it to 16GB. Maybe at the point I also didn't need to compile PR #6282 but I did it anyway :-) 4. Followed the development [compile guide](https://github.com/ollama/ollama/blob/main/docs/development.md). - Don't forget the prerequisites including installing ROCm. I was not sure what I needed so I went all-in `amdgpu-install --usecase=dkms,multimedia,rocm,opencl,openclsdk,mllib,hip,lrt` - Ran the build: `make -j 5; go build .` 6. Then through #7565 I found out I needed to set the `LD_LIBRARY_PATH` to take the compiled `libggml_rocm` otherwise I ran into `/tmp/ollama3477896540/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory` 7. I used then the following environment variables: ``` OLLAMA_MAX_LOADED_MODELS=1 HSA_ENABLE_SDMA=0 HCC_AMDGPU_TARGET=gfx90c OLLAMA_NUM_PARALLEL=1 LD_LIBRARY_PATH=<path to compiled repo>/ollama/dist/linux-amd64/lib/ollama/ HSA_OVERRIDE_GFX_VERSION=9.0.0 OLLAMA_FLASH_ATTENTION=1 ``` When setting these vars for testing with `export` don't forget to quote the versions (e.g.: `export HSA_OVERRIDE_GFX_VERSION="9.0.0"`). Other things: [`amdgpu_top`](https://cprimozic.net/notes/posts/amdgpu_top-a-modern-radeontop-alternative/) is a great utility to watch GPU sage. Only thing I now need to check is how to install this in a more permanent fashion, specifically which files I need to install where.
Author
Owner

@sebastian-philipp commented on GitHub (Dec 7, 2024):

@Sebazzz : Can you share some performance benchmarks with and without iGPU? (including some hardware infos)

<!-- gh-comment-id:2525227613 --> @sebastian-philipp commented on GitHub (Dec 7, 2024): @Sebazzz : Can you share some performance benchmarks with and without iGPU? (including some hardware infos)
Author
Owner

@Sebazzz commented on GitHub (Dec 7, 2024):

@Sebazzz : Can you share some performance benchmarks with and without iGPU? (including some hardware infos)

I'm happy to. Is there a defined approach for this?

<!-- gh-comment-id:2525315182 --> @Sebazzz commented on GitHub (Dec 7, 2024): > @Sebazzz : Can you share some performance benchmarks with and without iGPU? (including some hardware infos) I'm happy to. Is there a defined approach for this?
Author
Owner

@Sebazzz commented on GitHub (Dec 7, 2024):

Ollama won't use GTT memory? Or are you saying it needs to detect separate VRAM, but does not actually use it?

That last premise seems to be the case. It needs to detect the 1G VRAM but then dumps everything in GTT. Not sure if that matters for performance, but it is a waste of 1G of system memory.

<!-- gh-comment-id:2525318635 --> @Sebazzz commented on GitHub (Dec 7, 2024): > Ollama won't use GTT memory? Or are you saying it needs to detect separate VRAM, but does not actually use it? That last premise seems to be the case. It needs to detect the 1G VRAM but then dumps everything in GTT. Not sure if that matters for performance, but it is a waste of 1G of system memory.
Author
Owner

@petrm commented on GitHub (Dec 9, 2024):

  1. Set the graphics memory in the BIOS to at least 1GB. Through How to allocate more memory to my Ryzen APU's GPU? ROCm/ROCm#2014 I found out I could do this by entering the BIOS, then going to AMD CBS→NBIO Common Option→GFX Configuration. Then set the GFX mode to "UMD_SPECIFIED", and then set the newly appeared "frame buffer" option to at least 1GB. Because I have surplus memory, I set it to 16GB. Maybe at the point I also didn't need to compile PR AMD integrated graphic on linux kernel 6.9.9+, GTT memory, loading freeze fix #6282 but I did it anyway :-)

I applied following patch instead, as I have no use for dedicated video RAM:

diff --git a/discover/amd_linux.go b/discover/amd_linux.go
index 3f4d8a47..a8711b75 100644
--- a/discover/amd_linux.go
+++ b/discover/amd_linux.go
@@ -379,17 +379,6 @@ func AMDGetGPUInfo() ([]RocmGPUInfo, error) {
                        index:        gpuID,
                }
 
-               // iGPU detection, remove this check once we can support an iGPU variant of the rocm library
-               if totalMemory < IGPUMemLimit {
-                       reason := "unsupported Radeon iGPU detected skipping"
-                       slog.Info(reason, "id", gpuID, "total", format.HumanBytes2(totalMemory))
-                       unsupportedGPUs = append(unsupportedGPUs, UnsupportedGPUInfo{
-                               GpuInfo: gpuInfo.GpuInfo,
-                               Reason:  reason,
-                       })
-                       continue
-               }
-
                if int(major) < RocmComputeMin {
                        reason := fmt.Sprintf("amdgpu too old gfx%d%x%x", major, minor, patch)
                        slog.Warn(reason, "gpu", gpuID)
<!-- gh-comment-id:2528333500 --> @petrm commented on GitHub (Dec 9, 2024): > 3. Set the graphics memory in the BIOS to at least 1GB. Through [How to allocate more memory to my Ryzen APU's GPU? ROCm/ROCm#2014](https://github.com/ROCm/ROCm/issues/2014) I found out I could do this by entering the BIOS, then going to AMD CBS→NBIO Common Option→GFX Configuration. Then set the GFX mode to "UMD_SPECIFIED", and then set the newly appeared "frame buffer" option to at least 1GB. Because I have surplus memory, I set it to 16GB. Maybe at the point I also didn't need to compile PR [AMD integrated graphic on linux kernel 6.9.9+, GTT memory, loading freeze fix #6282](https://github.com/ollama/ollama/pull/6282) but I did it anyway :-) I applied following patch instead, as I have no use for dedicated video RAM: ``` diff --git a/discover/amd_linux.go b/discover/amd_linux.go index 3f4d8a47..a8711b75 100644 --- a/discover/amd_linux.go +++ b/discover/amd_linux.go @@ -379,17 +379,6 @@ func AMDGetGPUInfo() ([]RocmGPUInfo, error) { index: gpuID, } - // iGPU detection, remove this check once we can support an iGPU variant of the rocm library - if totalMemory < IGPUMemLimit { - reason := "unsupported Radeon iGPU detected skipping" - slog.Info(reason, "id", gpuID, "total", format.HumanBytes2(totalMemory)) - unsupportedGPUs = append(unsupportedGPUs, UnsupportedGPUInfo{ - GpuInfo: gpuInfo.GpuInfo, - Reason: reason, - }) - continue - } - if int(major) < RocmComputeMin { reason := fmt.Sprintf("amdgpu too old gfx%d%x%x", major, minor, patch) slog.Warn(reason, "gpu", gpuID) ```
Author
Owner

@Cirius1792 commented on GitHub (Jan 7, 2025):

@Sebazzz : Can you share some performance benchmarks with and without iGPU? (including some hardware infos)

I'm happy to. Is there a defined approach for this?

I don't know if there is any standard method but one that worked for me id the following:

for run in {1..10}; do echo "where was beethoven born?" | ollama run tinyllama --verbose 2>&1 >/dev/null | grep "eval rate:"; done

found here

<!-- gh-comment-id:2574545409 --> @Cirius1792 commented on GitHub (Jan 7, 2025): > > @Sebazzz : Can you share some performance benchmarks with and without iGPU? (including some hardware infos) > > I'm happy to. Is there a defined approach for this? I don't know if there is any standard method but one that worked for me id the following: `for run in {1..10}; do echo "where was beethoven born?" | ollama run tinyllama --verbose 2>&1 >/dev/null | grep "eval rate:"; done` found [here](https://github.com/ollama/ollama/blob/ddb6dc81c26721a08453f1db7f2727076e97dabc/docs/tutorials/amd-igpu-780m.md)
Author
Owner

@robertvazan commented on GitHub (Jan 7, 2025):

I don't know if there is any standard method but one that worked for me id the following:

for run in {1..10}; do echo "where was beethoven born?" | ollama run tinyllama --verbose 2>&1 >/dev/null | grep "eval rate:"; done

found here

Prompt processing is heavily parallelized and its performance also depends on total length of the prompt. I would recommend testing with at least 1K input tokens. Try to ask for a summary of some article.

<!-- gh-comment-id:2574741332 --> @robertvazan commented on GitHub (Jan 7, 2025): > I don't know if there is any standard method but one that worked for me id the following: > > `for run in {1..10}; do echo "where was beethoven born?" | ollama run tinyllama --verbose 2>&1 >/dev/null | grep "eval rate:"; done` > > found [here](https://github.com/ollama/ollama/blob/ddb6dc81c26721a08453f1db7f2727076e97dabc/docs/tutorials/amd-igpu-780m.md) Prompt processing is heavily parallelized and its performance also depends on total length of the prompt. I would recommend testing with at least 1K input tokens. Try to ask for a summary of some article.
Author
Owner

@DocMAX commented on GitHub (Feb 16, 2025):

TLDR; whats the currently easiest way to run ollama on a AMD APU (5800U)? Thanks.

<!-- gh-comment-id:2661409693 --> @DocMAX commented on GitHub (Feb 16, 2025): TLDR; whats the currently easiest way to run ollama on a AMD APU (5800U)? Thanks.
Author
Owner

@rjmalagon commented on GitHub (Feb 17, 2025):

Shameless plug... I pretend to maintain patches that facilitate the operation of Ollama on an AMD APUs.
https://github.com/rjmalagon/ollama-linux-amd-apu

<!-- gh-comment-id:2663729627 --> @rjmalagon commented on GitHub (Feb 17, 2025): Shameless plug... I pretend to maintain patches that facilitate the operation of Ollama on an AMD APUs. https://github.com/rjmalagon/ollama-linux-amd-apu
Author
Owner

@Sebazzz commented on GitHub (Feb 17, 2025):

How is that different than the PR?

Met vriendelijke groet,
Sebastiaan Dammann


Van: Ricardo Jesus Malagon Jerez @.>
Verzonden: Monday, February 17, 2025 6:29:12 PM
Aan: ollama/ollama @.
>
CC: Sebastiaan Dammann @.>; Mention @.>
Onderwerp: Re: [ollama/ollama] Integrated AMD GPU support (Issue #2637)

Shameless plug... I pretend to maintain patches that facilitate the operation of Ollama on an AMD APUs.
https://github.com/rjmalagon/ollama-linux-amd-apu


Reply to this email directly, view it on GitHubhttps://github.com/ollama/ollama/issues/2637#issuecomment-2663729627, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAK4FMMRBOTV4ZDEDV27Z332QIL6RAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRTG4ZDSNRSG4.
You are receiving this because you were mentioned.Message ID: @.***>

[rjmalagon]rjmalagon left a comment (ollama/ollama#2637)https://github.com/ollama/ollama/issues/2637#issuecomment-2663729627

Shameless plug... I pretend to maintain patches that facilitate the operation of Ollama on an AMD APUs.
https://github.com/rjmalagon/ollama-linux-amd-apu


Reply to this email directly, view it on GitHubhttps://github.com/ollama/ollama/issues/2637#issuecomment-2663729627, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAK4FMMRBOTV4ZDEDV27Z332QIL6RAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRTG4ZDSNRSG4.
You are receiving this because you were mentioned.Message ID: @.***>

<!-- gh-comment-id:2663770519 --> @Sebazzz commented on GitHub (Feb 17, 2025): How is that different than the PR? Met vriendelijke groet, Sebastiaan Dammann ________________________________ Van: Ricardo Jesus Malagon Jerez ***@***.***> Verzonden: Monday, February 17, 2025 6:29:12 PM Aan: ollama/ollama ***@***.***> CC: Sebastiaan Dammann ***@***.***>; Mention ***@***.***> Onderwerp: Re: [ollama/ollama] Integrated AMD GPU support (Issue #2637) Shameless plug... I pretend to maintain patches that facilitate the operation of Ollama on an AMD APUs. https://github.com/rjmalagon/ollama-linux-amd-apu — Reply to this email directly, view it on GitHub<https://github.com/ollama/ollama/issues/2637#issuecomment-2663729627>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAK4FMMRBOTV4ZDEDV27Z332QIL6RAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRTG4ZDSNRSG4>. You are receiving this because you were mentioned.Message ID: ***@***.***> [rjmalagon]rjmalagon left a comment (ollama/ollama#2637)<https://github.com/ollama/ollama/issues/2637#issuecomment-2663729627> Shameless plug... I pretend to maintain patches that facilitate the operation of Ollama on an AMD APUs. https://github.com/rjmalagon/ollama-linux-amd-apu — Reply to this email directly, view it on GitHub<https://github.com/ollama/ollama/issues/2637#issuecomment-2663729627>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAK4FMMRBOTV4ZDEDV27Z332QIL6RAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRTG4ZDSNRSG4>. You are receiving this because you were mentioned.Message ID: ***@***.***>
Author
Owner

@MaciejMogilany commented on GitHub (Feb 17, 2025):

If it's duct taped than make high quality path yourself. And don't use this
duct taped code on your container. Do not insult anyone who try to push
things further.

pon., 17 lut 2025, 20:29 użytkownik Ricardo Jesus Malagon Jerez <
@.***> napisał:

Shameless plug... I pretend to maintain patches that facilitate the
operation of Ollama on an AMD APUs.
https://github.com/rjmalagon/ollama-linux-amd-apu


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/2637#issuecomment-2663729627,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ANORXN2VRN3DACMAJ2MY4AL2QIL67AVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRTG4ZDSNRSG4
.
You are receiving this because you were mentioned.Message ID:
@.***>
[image: rjmalagon]rjmalagon left a comment (ollama/ollama#2637)
https://github.com/ollama/ollama/issues/2637#issuecomment-2663729627

Shameless plug... I pretend to maintain patches that facilitate the
operation of Ollama on an AMD APUs.
https://github.com/rjmalagon/ollama-linux-amd-apu


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/2637#issuecomment-2663729627,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ANORXN2VRN3DACMAJ2MY4AL2QIL67AVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRTG4ZDSNRSG4
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:2664025803 --> @MaciejMogilany commented on GitHub (Feb 17, 2025): If it's duct taped than make high quality path yourself. And don't use this duct taped code on your container. Do not insult anyone who try to push things further. pon., 17 lut 2025, 20:29 użytkownik Ricardo Jesus Malagon Jerez < ***@***.***> napisał: > Shameless plug... I pretend to maintain patches that facilitate the > operation of Ollama on an AMD APUs. > https://github.com/rjmalagon/ollama-linux-amd-apu > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/2637#issuecomment-2663729627>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ANORXN2VRN3DACMAJ2MY4AL2QIL67AVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRTG4ZDSNRSG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > [image: rjmalagon]*rjmalagon* left a comment (ollama/ollama#2637) > <https://github.com/ollama/ollama/issues/2637#issuecomment-2663729627> > > Shameless plug... I pretend to maintain patches that facilitate the > operation of Ollama on an AMD APUs. > https://github.com/rjmalagon/ollama-linux-amd-apu > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/2637#issuecomment-2663729627>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ANORXN2VRN3DACMAJ2MY4AL2QIL67AVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRTG4ZDSNRSG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@lanwin commented on GitHub (Feb 18, 2025):

Is it really worth it? Mean is the performance of the APU really better then processing it directly on the CPU?

<!-- gh-comment-id:2665059846 --> @lanwin commented on GitHub (Feb 18, 2025): Is it really worth it? Mean is the performance of the APU really better then processing it directly on the CPU?
Author
Owner

@AwesomenessZ commented on GitHub (Feb 18, 2025):

Yes its at least double imo on the 5600g. Plus other things can use the cpu during that time.
-------- Original message --------From: Steve Wagner @.> Date: 2/18/25 1:28 AM (GMT-08:00) To: ollama/ollama @.> Cc: Awe @.>, Manual @.> Subject: Re: [ollama/ollama] Integrated AMD GPU support (Issue #2637)
Is it really worth it? Mean is the performance of the APU really better then processing it directly on the CPU?—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

lanwin left a comment (ollama/ollama#2637)
Is it really worth it? Mean is the performance of the APU really better then processing it directly on the CPU?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

<!-- gh-comment-id:2666071243 --> @AwesomenessZ commented on GitHub (Feb 18, 2025): Yes its at least double imo on the 5600g. Plus other things can use the cpu during that time. -------- Original message --------From: Steve Wagner ***@***.***> Date: 2/18/25 1:28 AM (GMT-08:00) To: ollama/ollama ***@***.***> Cc: Awe ***@***.***>, Manual ***@***.***> Subject: Re: [ollama/ollama] Integrated AMD GPU support (Issue #2637) Is it really worth it? Mean is the performance of the APU really better then processing it directly on the CPU?—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> lanwin left a comment (ollama/ollama#2637) Is it really worth it? Mean is the performance of the APU really better then processing it directly on the CPU? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
Author
Owner

@lanwin commented on GitHub (Feb 19, 2025):

Yes its at least double imo on the 5600g. Plus other things can use the cpu during that time.

@AwesomenessZ would you share your setup? I tried to get it running on my 4750G and it felt slower.

<!-- gh-comment-id:2668080939 --> @lanwin commented on GitHub (Feb 19, 2025): > Yes its at least double imo on the 5600g. Plus other things can use the cpu during that time. @AwesomenessZ would you share your setup? I tried to get it running on my 4750G and it felt slower.
Author
Owner

@DocMAX commented on GitHub (Feb 19, 2025):

Is it really worth it? Mean is the performance of the APU really better then processing it directly on the CPU?

As far i could run AI tasks on my 5800U APU i didn't see any big performance impact. Almost the same as CPU. Anyone else has other experiences?

<!-- gh-comment-id:2668150990 --> @DocMAX commented on GitHub (Feb 19, 2025): > Is it really worth it? Mean is the performance of the APU really better then processing it directly on the CPU? As far i could run AI tasks on my 5800U APU i didn't see any big performance impact. Almost the same as CPU. Anyone else has other experiences?
Author
Owner

@MaciejMogilany commented on GitHub (Feb 19, 2025):

prompt ingestion is much quicker, and PC is cooler. make prompt to
summarize one page of text to see difference. Still APU is not up to the
game. Next gen should be better as latest AMD AI APU has 4 channel memory
(more bandwidth)

śr., 19 lut 2025, 13:06 użytkownik DocMAX @.***>
napisał:

Is it really worth it? Mean is the performance of the APU really better
then processing it directly on the CPU?

As far i could run AI tasks on my 5800U APU i didn't see any big
performance impact. Almost the same as CPU. Anyone else has other
experiences?


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/2637#issuecomment-2668150990,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ANORXN7CZ7LOO47PR4QW45D2QRJQZAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRYGE2TAOJZGA
.
You are receiving this because you were mentioned.Message ID:
@.***>
[image: DocMAX]DocMAX left a comment (ollama/ollama#2637)
https://github.com/ollama/ollama/issues/2637#issuecomment-2668150990

Is it really worth it? Mean is the performance of the APU really better
then processing it directly on the CPU?

As far i could run AI tasks on my 5800U APU i didn't see any big
performance impact. Almost the same as CPU. Anyone else has other
experiences?


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/2637#issuecomment-2668150990,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ANORXN7CZ7LOO47PR4QW45D2QRJQZAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRYGE2TAOJZGA
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:2668178572 --> @MaciejMogilany commented on GitHub (Feb 19, 2025): prompt ingestion is much quicker, and PC is cooler. make prompt to summarize one page of text to see difference. Still APU is not up to the game. Next gen should be better as latest AMD AI APU has 4 channel memory (more bandwidth) śr., 19 lut 2025, 13:06 użytkownik DocMAX ***@***.***> napisał: > Is it really worth it? Mean is the performance of the APU really better > then processing it directly on the CPU? > > As far i could run AI tasks on my 5800U APU i didn't see any big > performance impact. Almost the same as CPU. Anyone else has other > experiences? > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/2637#issuecomment-2668150990>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ANORXN7CZ7LOO47PR4QW45D2QRJQZAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRYGE2TAOJZGA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > [image: DocMAX]*DocMAX* left a comment (ollama/ollama#2637) > <https://github.com/ollama/ollama/issues/2637#issuecomment-2668150990> > > Is it really worth it? Mean is the performance of the APU really better > then processing it directly on the CPU? > > As far i could run AI tasks on my 5800U APU i didn't see any big > performance impact. Almost the same as CPU. Anyone else has other > experiences? > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/2637#issuecomment-2668150990>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ANORXN7CZ7LOO47PR4QW45D2QRJQZAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRYGE2TAOJZGA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@AwesomenessZ commented on GitHub (Feb 19, 2025):

Yes its at least double imo on the 5600g. Plus other things can use the cpu during that time.

@AwesomenessZ would you share your setup? I tried to get it running on my 4750G and it felt slower.

@lanwin

Docker compose:

ollama:
    image: ollama/ollama:rocm
    restart: unless-stopped
    environment:
      - "HSA_OVERRIDE_GFX_VERSION=9.0.0"
      - "HCC_AMDGPU_TARGETS=gfx900"
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
    ports:
      - "11434:11434"
    volumes:
      - /docker/ollama:/root/.ollama

But in the past few weeks I have been having GPU driver crashes, unfortunately.

<!-- gh-comment-id:2669316864 --> @AwesomenessZ commented on GitHub (Feb 19, 2025): > > Yes its at least double imo on the 5600g. Plus other things can use the cpu during that time. > > [@AwesomenessZ](https://github.com/AwesomenessZ) would you share your setup? I tried to get it running on my 4750G and it felt slower. @lanwin Docker compose: ```ymal ollama: image: ollama/ollama:rocm restart: unless-stopped environment: - "HSA_OVERRIDE_GFX_VERSION=9.0.0" - "HCC_AMDGPU_TARGETS=gfx900" devices: - /dev/kfd - /dev/dri group_add: - video ports: - "11434:11434" volumes: - /docker/ollama:/root/.ollama ``` But in the past few weeks I have been having GPU driver crashes, unfortunately.
Author
Owner

@lanwin commented on GitHub (Feb 19, 2025):

Strage. I had that working a few weeks ago. Now I get

"/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory"

<!-- gh-comment-id:2669478398 --> @lanwin commented on GitHub (Feb 19, 2025): Strage. I had that working a few weeks ago. Now I get "/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory"
Author
Owner

@rjmalagon commented on GitHub (Feb 28, 2025):

Strage. I had that working a few weeks ago. Now I get

"/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory"

Not really a problem, will work without it, but you can map this file from a local file, I use a copy from my host.

<!-- gh-comment-id:2691482251 --> @rjmalagon commented on GitHub (Feb 28, 2025): > Strage. I had that working a few weeks ago. Now I get > > "/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory" Not really a problem, will work without it, but you can map this file from a local file, I use a copy from my host.
Author
Owner

@malteneuss commented on GitHub (Mar 19, 2025):

Is it normal that the iGPU is slower than CPU? I have a AMD Ryzen 7 7735HS with AMD Radeon 680M gfx1035 and with e.g. model qwen2.5-coder:3b get 14 token/s on iGPU and 15 tokens/s on CPU. I'm using NixOS with Ollama 0.6.0 following the installation instructions in https://wiki.nixos.org/wiki/Ollama.

<!-- gh-comment-id:2738294291 --> @malteneuss commented on GitHub (Mar 19, 2025): Is it normal that the iGPU is slower than CPU? I have a AMD Ryzen 7 7735HS with AMD Radeon 680M gfx1035 and with e.g. model qwen2.5-coder:3b get 14 token/s on iGPU and 15 tokens/s on CPU. I'm using NixOS with Ollama 0.6.0 following the installation instructions in https://wiki.nixos.org/wiki/Ollama.
Author
Owner

@taweili commented on GitHub (Mar 20, 2025):

Is it normal that the iGPU is slower than CPU? I have a AMD Ryzen 7 7735HS with AMD Radeon 680M gfx1035 and with e.g. model qwen2.5-coder:3b get 14 token/s on iGPU and 15 tokens/s on CPU. I'm using NixOS with Ollama 0.6.0 following the installation instructions in https://wiki.nixos.org/wiki/Ollama.

Yes, for 7735HS, the iGPU gives about the same performance as the CPU in the tokens/s. The only advantage is that it freed up the CPU to process other things while the model running on GPU.

<!-- gh-comment-id:2738695875 --> @taweili commented on GitHub (Mar 20, 2025): > Is it normal that the iGPU is slower than CPU? I have a AMD Ryzen 7 7735HS with AMD Radeon 680M gfx1035 and with e.g. model qwen2.5-coder:3b get 14 token/s on iGPU and 15 tokens/s on CPU. I'm using NixOS with Ollama 0.6.0 following the installation instructions in https://wiki.nixos.org/wiki/Ollama. Yes, for 7735HS, the iGPU gives about the same performance as the CPU in the tokens/s. The only advantage is that it freed up the CPU to process other things while the model running on GPU.
Author
Owner

@ivanbaldo commented on GitHub (May 21, 2025):

Shouldn't this be closed now? Isn't AMD iGPU already available now?

<!-- gh-comment-id:2898803542 --> @ivanbaldo commented on GitHub (May 21, 2025): Shouldn't this be closed now? Isn't AMD iGPU already available now?
Author
Owner

@DocMAX commented on GitHub (May 21, 2025):

No, iGPU not supported yet

<!-- gh-comment-id:2899034454 --> @DocMAX commented on GitHub (May 21, 2025): No, iGPU not supported yet
Author
Owner

@rjmalagon commented on GitHub (May 21, 2025):

AMD iGPU is niche, on Linux (current kernel), there is support for GTT memory for ROCM, this allows use almost all main memory for GPU.

But it needs proper kernel configuration, and extra memory management routines on Ollama, without the later, Ollama use only the assigned VRAM by the UEFI/Firmware.

Just for info, Fedora 42 has all in place (newer linux kernel, current ROCM access, etc), but a conservative Ubuntu 22.04 LTS misses GTT support for ROCM.

Sent from Proton Mail Android

-------- Original Message --------
On 5/21/25 1:35 PM, DocMAX wrote:

DocMAX left a comment (ollama/ollama#2637)

No, iGPU not supported yet


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: @.***>

@github.com>

<!-- gh-comment-id:2899399806 --> @rjmalagon commented on GitHub (May 21, 2025): AMD iGPU is niche, on Linux (current kernel), there is support for GTT memory for ROCM, this allows use almost all main memory for GPU. But it needs proper kernel configuration, and extra memory management routines on Ollama, without the later, Ollama use only the assigned VRAM by the UEFI/Firmware. Just for info, Fedora 42 has all in place (newer linux kernel, current ROCM access, etc), but a conservative Ubuntu 22.04 LTS misses GTT support for ROCM. Sent from Proton Mail Android -------- Original Message -------- On 5/21/25 1:35 PM, DocMAX wrote: > DocMAX left a comment [(ollama/ollama#2637)](https://github.com/ollama/ollama/issues/2637#issuecomment-2899034454) > > No, iGPU not supported yet > > — > Reply to this email directly, [view it on GitHub](https://github.com/ollama/ollama/issues/2637#issuecomment-2899034454), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ADFPYRJX25AYR2XRNFGLTUT27TIPHAVCNFSM6AAAAABDTGRJCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOJZGAZTINBVGQ). > You are receiving this because you were mentioned.Message ID: ***@***.***> @github.com>
Author
Owner

@Crandel commented on GitHub (Jun 11, 2025):

When iGPU will be supported?

<!-- gh-comment-id:2961630072 --> @Crandel commented on GitHub (Jun 11, 2025): When iGPU will be supported?
Author
Owner

@ddpasa commented on GitHub (Jun 11, 2025):

When iGPU will be supported?

you can use iGPUs via Vulkan. llama.cpp has supported Vulkan for months. You can either use the Vulkan fork of ollama, or better just use llama.cpp directly.

<!-- gh-comment-id:2963252313 --> @ddpasa commented on GitHub (Jun 11, 2025): > When iGPU will be supported? you can use iGPUs via Vulkan. llama.cpp has supported Vulkan for months. You can either use the Vulkan fork of ollama, or better just use llama.cpp directly.
Author
Owner

@ligjn commented on GitHub (Jun 25, 2025):

When iGPU will be supported?

you can use iGPUs via Vulkan. llama.cpp has supported Vulkan for months. You can either use the Vulkan fork of ollama, or better just use llama.cpp directly.

@ddpasa I didn't find the ollama branch named vulkan in the repository. Could you provide an address? Thank you.

<!-- gh-comment-id:3004055779 --> @ligjn commented on GitHub (Jun 25, 2025): > > When iGPU will be supported? > > you can use iGPUs via Vulkan. llama.cpp has supported Vulkan for months. You can either use the Vulkan fork of ollama, or better just use llama.cpp directly. @ddpasa I didn't find the ollama branch named vulkan in the repository. Could you provide an address? Thank you.
Author
Owner

@ddpasa commented on GitHub (Jun 25, 2025):

When iGPU will be supported?

you can use iGPUs via Vulkan. llama.cpp has supported Vulkan for months. You can either use the Vulkan fork of ollama, or better just use llama.cpp directly.

@ddpasa I didn't find the ollama branch named vulkan in the repository. Could you provide an address? Thank you.

ollama doesn't support vulkan. the devs don't care.

which is okay, because ollama is just a thin wrapper for llama.cpp server. you can just use llama.cpp directly: https://github.com/ggml-org/llama.cpp

<!-- gh-comment-id:3004079084 --> @ddpasa commented on GitHub (Jun 25, 2025): > > > When iGPU will be supported? > > > > > > you can use iGPUs via Vulkan. llama.cpp has supported Vulkan for months. You can either use the Vulkan fork of ollama, or better just use llama.cpp directly. > > [@ddpasa](https://github.com/ddpasa) I didn't find the ollama branch named vulkan in the repository. Could you provide an address? Thank you. ollama doesn't support vulkan. the devs don't care. which is okay, because ollama is just a thin wrapper for llama.cpp server. you can just use llama.cpp directly: https://github.com/ggml-org/llama.cpp
Author
Owner

@Crandel commented on GitHub (Jun 25, 2025):

@ligjn You can try my fork here for AMD iGPU. It works very well. I've created Arch Linux AUR packages too

I'm giving up with maintaining my fork for AMD iGPU. I've switch to llama.cpp + llama-swap and I'm using Vulkan backend now. Looks like devs have no interest in anything except Nvidia and corporate support at the moment.

<!-- gh-comment-id:3004089246 --> @Crandel commented on GitHub (Jun 25, 2025): ~@ligjn You can try my fork [here](https://github.com/Crandel/ollama-amd-igpu/tree/amd-igpu) for AMD iGPU. It works very well. I've created Arch Linux [AUR packages](https://aur.archlinux.org/packages?O=0&SeB=nd&K=ollama-amd-igpu&outdated=&SB=p&SO=d&PP=50&submit=Go) too~ I'm [giving up](https://github.com/Crandel/ollama-amd-igpu) with maintaining my fork for AMD iGPU. I've switch to llama.cpp + llama-swap and I'm using Vulkan backend now. Looks like devs have no interest in anything except Nvidia and corporate support at the moment.
Author
Owner

@ericcurtin commented on GitHub (Oct 13, 2025):

We added Vulkan support to docker model runner, so we cover this feature:

https://www.docker.com/blog/docker-model-runner-vulkan-gpu-support/

We've also put effort to putting all our code in one central place to make it easier for people to contribute. Please star, fork and contribute.

https://github.com/docker/model-runner

We have vulkan support. You can pull models from Docker Hub, Huggingface or any other OCI registry. You can also push models to Docker Hub or any other OCI registry.

<!-- gh-comment-id:3399440333 --> @ericcurtin commented on GitHub (Oct 13, 2025): We added Vulkan support to docker model runner, so we cover this feature: https://www.docker.com/blog/docker-model-runner-vulkan-gpu-support/ We've also put effort to putting all our code in one central place to make it easier for people to contribute. Please star, fork and contribute. https://github.com/docker/model-runner We have vulkan support. You can pull models from Docker Hub, Huggingface or any other OCI registry. You can also push models to Docker Hub or any other OCI registry.
Author
Owner

@Djip007 commented on GitHub (Oct 21, 2025):

if someone want to make a patch for AMD iGPU there is some change on llama.cpp:

03792ad936/ggml/src/ggml-cuda/ggml-cuda.cu (L110-L135)

Now it is simple to use all RAM on iGPU you only need to set env var: GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON no need for vRAM/GGT,

so a simple patch can be:

  • if AMD-iGPU

    export GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
    available GPU memory == available CPU RAM

or something like that. It may even work on Windows.

<!-- gh-comment-id:3429048429 --> @Djip007 commented on GitHub (Oct 21, 2025): if someone want to make a patch for AMD iGPU there is some change on llama.cpp: https://github.com/ggml-org/llama.cpp/blob/03792ad93609fc67e41041c6347d9aa14e5e0d74/ggml/src/ggml-cuda/ggml-cuda.cu#L110-L135 Now it is simple to use all RAM on iGPU you only need to set env var: `GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON` no need for vRAM/GGT, so a simple patch can be: - if AMD-iGPU > export GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON > available GPU memory == available CPU RAM or something like that. It may even work on Windows.
Author
Owner

@Djip007 commented on GitHub (Oct 26, 2025):

just can't figure where the device RAM is compute now... did it use what ggml report now?

<!-- gh-comment-id:3447928245 --> @Djip007 commented on GitHub (Oct 26, 2025): just can't figure where the device RAM is compute now... did it use what ggml report now?
Author
Owner

@dhiltgen commented on GitHub (Nov 18, 2025):

In 0.12.11 Vulkan is now included in the official binaries, but still experimental. To enable, set OLLAMA_VULKAN=1 for the server. https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-do-i-configure-ollama-server

<!-- gh-comment-id:3544531341 --> @dhiltgen commented on GitHub (Nov 18, 2025): In 0.12.11 Vulkan is now included in the official binaries, but still experimental. To enable, set OLLAMA_VULKAN=1 for the server. https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-do-i-configure-ollama-server
Author
Owner

@rmcmilli commented on GitHub (Nov 19, 2025):

In 0.12.11 Vulkan is now included in the official binaries, but still experimental. To enable, set OLLAMA_VULKAN=1 for the server. https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-do-i-configure-ollama-server

Testing now, but fwiw this has been far more stable than rocm in my limited testing on a 780m.

<!-- gh-comment-id:3554304228 --> @rmcmilli commented on GitHub (Nov 19, 2025): > In 0.12.11 Vulkan is now included in the official binaries, but still experimental. To enable, set OLLAMA_VULKAN=1 for the server. https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-do-i-configure-ollama-server Testing now, but fwiw this has been far more stable than rocm in my limited testing on a 780m.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1559