[GH-ISSUE #3840] Vega 56 (gfx900) fails to load model - hipMemGetInfo - error: invalid argument #28138

Closed
opened 2026-04-22 05:58:29 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @quwassar on GitHub (Apr 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3840

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Hi! I have trouble with used video card Vega 56 AMD:

Apr 23 08:24:36 chat-server ollama[95121]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
Apr 23 08:24:36 chat-server ollama[95121]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
Apr 23 08:24:36 chat-server ollama[95121]: ggml_cuda_init: found 1 ROCm devices:
Apr 23 08:24:36 chat-server ollama[95121]:   Device 0: Radeon RX Vega, compute capability 9.0, VMM: no
Apr 23 08:24:36 chat-server ollama[95121]: CUDA error: invalid argument
Apr 23 08:24:36 chat-server ollama[95121]:   current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:2612
Apr 23 08:24:36 chat-server ollama[95121]:   hipMemGetInfo(free, total)
Apr 23 08:24:36 chat-server ollama[95121]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
Apr 23 08:24:36 chat-server ollama[95179]: [New LWP 95176]
Apr 23 08:24:37 chat-server ollama[95179]: [Thread debugging using libthread_db enabled]
Apr 23 08:24:37 chat-server ollama[95179]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Apr 23 08:24:37 chat-server ollama[95179]: 0x00007f4968df142f in __GI___wait4 (pid=95179, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Apr 23 08:24:37 chat-server ollama[95121]: 30        ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
Apr 23 08:24:37 chat-server ollama[95179]: #0  0x00007f4968df142f in __GI___wait4 (pid=95179, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Apr 23 08:24:37 chat-server ollama[95179]: 30        in ../sysdeps/unix/sysv/linux/wait4.c
Apr 23 08:24:37 chat-server ollama[95179]: #1  0x00000000024e8084 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
Apr 23 08:24:37 chat-server ollama[95179]: #2  0x00000000024e900f in ggml_backend_cuda_get_device_memory ()
Apr 23 08:24:37 chat-server ollama[95179]: #3  0x00000000023dc720 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) ()
Apr 23 08:24:37 chat-server ollama[95179]: #4  0x00000000023bf6e6 in llama_model_load(std::string const&, llama_model&, llama_model_params&) ()
Apr 23 08:24:37 chat-server ollama[95179]: #5  0x00000000023bd47f in llama_load_model_from_file ()
Apr 23 08:24:37 chat-server ollama[95179]: #6  0x0000000002378472 in llama_init_from_gpt_params(gpt_params&) ()
Apr 23 08:24:37 chat-server ollama[95179]: #7  0x00000000022d8754 in llama_server_context::load_model(gpt_params const&) ()
Apr 23 08:24:37 chat-server ollama[95179]: #8  0x00000000022c4381 in main ()
Apr 23 08:24:37 chat-server ollama[95179]: [Inferior 1 (process 95175) detached]
Apr 23 08:24:37 chat-server ollama[95121]: time=2024-04-23T08:24:37.936Z level=ERROR source=routes.go:120 msg="error loading llama server" error="llama runner process no longer running: -1 CUDA error: invalid argument\n  current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:2612\n  hipMemGetInfo(free, total)\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !\"CUDA error\""
Apr 23 08:24:37 chat-server ollama[95121]: time=2024-04-23T08:24:37.936Z level=DEBUG source=server.go:832 msg="stopping llama server"

Can you help me?

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.1.32

Originally created by @quwassar on GitHub (Apr 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3840 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Hi! I have trouble with used video card Vega 56 AMD: ``` Apr 23 08:24:36 chat-server ollama[95121]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Apr 23 08:24:36 chat-server ollama[95121]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes Apr 23 08:24:36 chat-server ollama[95121]: ggml_cuda_init: found 1 ROCm devices: Apr 23 08:24:36 chat-server ollama[95121]: Device 0: Radeon RX Vega, compute capability 9.0, VMM: no Apr 23 08:24:36 chat-server ollama[95121]: CUDA error: invalid argument Apr 23 08:24:36 chat-server ollama[95121]: current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:2612 Apr 23 08:24:36 chat-server ollama[95121]: hipMemGetInfo(free, total) Apr 23 08:24:36 chat-server ollama[95121]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error" Apr 23 08:24:36 chat-server ollama[95179]: [New LWP 95176] Apr 23 08:24:37 chat-server ollama[95179]: [Thread debugging using libthread_db enabled] Apr 23 08:24:37 chat-server ollama[95179]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Apr 23 08:24:37 chat-server ollama[95179]: 0x00007f4968df142f in __GI___wait4 (pid=95179, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Apr 23 08:24:37 chat-server ollama[95121]: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. Apr 23 08:24:37 chat-server ollama[95179]: #0 0x00007f4968df142f in __GI___wait4 (pid=95179, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Apr 23 08:24:37 chat-server ollama[95179]: 30 in ../sysdeps/unix/sysv/linux/wait4.c Apr 23 08:24:37 chat-server ollama[95179]: #1 0x00000000024e8084 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () Apr 23 08:24:37 chat-server ollama[95179]: #2 0x00000000024e900f in ggml_backend_cuda_get_device_memory () Apr 23 08:24:37 chat-server ollama[95179]: #3 0x00000000023dc720 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () Apr 23 08:24:37 chat-server ollama[95179]: #4 0x00000000023bf6e6 in llama_model_load(std::string const&, llama_model&, llama_model_params&) () Apr 23 08:24:37 chat-server ollama[95179]: #5 0x00000000023bd47f in llama_load_model_from_file () Apr 23 08:24:37 chat-server ollama[95179]: #6 0x0000000002378472 in llama_init_from_gpt_params(gpt_params&) () Apr 23 08:24:37 chat-server ollama[95179]: #7 0x00000000022d8754 in llama_server_context::load_model(gpt_params const&) () Apr 23 08:24:37 chat-server ollama[95179]: #8 0x00000000022c4381 in main () Apr 23 08:24:37 chat-server ollama[95179]: [Inferior 1 (process 95175) detached] Apr 23 08:24:37 chat-server ollama[95121]: time=2024-04-23T08:24:37.936Z level=ERROR source=routes.go:120 msg="error loading llama server" error="llama runner process no longer running: -1 CUDA error: invalid argument\n current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:2612\n hipMemGetInfo(free, total)\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !\"CUDA error\"" Apr 23 08:24:37 chat-server ollama[95121]: time=2024-04-23T08:24:37.936Z level=DEBUG source=server.go:832 msg="stopping llama server" ``` Can you help me? ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.1.32
GiteaMirror added the amdbug labels 2026-04-22 05:58:29 -05:00
Author
Owner

@quwassar commented on GitHub (Apr 23, 2024):

Maybe it's help: when I run service, get this error:
Apr 23 08:24:35 chat-server ollama[95175]: {"function":"server_params_parse","level":"WARN","line":2494,"msg":"server.cpp is not built with verbose logging.","tid":"139950312619072","timestamp":1713860675}

This issue can't help to run Ollama

<!-- gh-comment-id:2071723768 --> @quwassar commented on GitHub (Apr 23, 2024): Maybe it's help: when I run service, get this error: ` Apr 23 08:24:35 chat-server ollama[95175]: {"function":"server_params_parse","level":"WARN","line":2494,"msg":"server.cpp is not built with verbose logging.","tid":"139950312619072","timestamp":1713860675} ` This [issue](https://github.com/ollama/ollama/issues/3425) can't help to run Ollama
Author
Owner

@descention commented on GitHub (May 3, 2024):

I also have a Vega 56. Mine is not throwing an error, just hanging on/after llm_load_tensors.
Linux asuran-mkvi 6.6.29 #1-NixOS SMP PREEMPT_DYNAMIC Sat Apr 27 15:11:44 UTC 2024 x86_64 GNU/Linux
ollama version is 0.1.31

May 03 14:24:38 asuran-mkvi systemd[1]: Started Server for local large language models.
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.754-04:00 level=INFO source=images.go:804 msg="total blobs: 61"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.755-04:00 level=INFO source=images.go:811 msg="total unused blobs removed: 0"
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
May 03 14:24:38 asuran-mkvi ollama[5361]:  - using env:        export GIN_MODE=release
May 03 14:24:38 asuran-mkvi ollama[5361]:  - using code:        gin.SetMode(gin.ReleaseMode)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.PullModelHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.GenerateHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.ChatHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.EmbeddingsHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.CreateModelHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.PushModelHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.CopyModelHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.DeleteModelHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.ShowModelHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.CreateBlobHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.HeadBlobHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.ChatHandler (6 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.ListModelsHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.ListModelsHandler (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.756-04:00 level=INFO source=routes.go:1118 msg="Listening on [::]:11434 (version 0.1.31)"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.756-04:00 level=INFO source=payload_common.go:113 msg="Extracting dynamic libraries to /tmp/ollama2062404392/runners ..."
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=payload_common.go:140 msg="Dynamic LLM libraries [cpu_avx rocm cpu cpu_avx2]"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=payload_common.go:141 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=gpu.go:115 msg="Detecting GPU type"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libcudart.so*"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/tmp/ollama2062404392/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /run/opengl-driver/lib/libcudart.so** /nix/store/5kmapamwsb2q1nz3d335vpf4fbvxjplr-rocm-smi-6.0.2/lib/libcudart.so**]"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /run/opengl-driver/lib/libnvidia-ml.so* /nix/store/5kmapamwsb2q1nz3d335vpf4fbvxjplr-rocm-smi-6.0.2/lib/libnvidia-ml.so*]"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=amd_linux.go:152 msg="discovering VRAM for amdgpu devices"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=amd_linux.go:171 msg="amdgpu devices [0]"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=amd_linux.go:246 msg="[0] amdgpu totalMemory 8176M"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=amd_linux.go:247 msg="[0] amdgpu freeMemory  8176M"
May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=gpu.go:254 msg="rocm detected 1 devices with 7152M available memory"
May 03 14:25:01 asuran-mkvi ollama[5361]: time=2024-05-03T14:25:01.774-04:00 level=INFO source=routes.go:843 msg="skipping file: registry.ollama.ai/library/mixtral:8x7b-instruct-v0.1-q5_0"
May 03 14:25:01 asuran-mkvi ollama[5361]: [GIN] 2024/05/03 - 14:25:01 | 200 |    3.929711ms |       10.88.0.2 | GET      "/api/tags"
May 03 14:25:01 asuran-mkvi ollama[5361]: [GIN] 2024/05/03 - 14:25:01 | 200 |      17.944µs |       10.88.0.2 | GET      "/api/version"
May 03 14:26:01 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:01.159-04:00 level=INFO source=routes.go:843 msg="skipping file: registry.ollama.ai/library/mixtral:8x7b-instruct-v0.1-q5_0"
May 03 14:26:01 asuran-mkvi ollama[5361]: [GIN] 2024/05/03 - 14:26:01 | 200 |    3.397442ms |       10.88.0.2 | GET      "/api/tags"
May 03 14:26:01 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:01.205-04:00 level=INFO source=routes.go:843 msg="skipping file: registry.ollama.ai/library/mixtral:8x7b-instruct-v0.1-q5_0"
May 03 14:26:01 asuran-mkvi ollama[5361]: [GIN] 2024/05/03 - 14:26:01 | 200 |    1.923652ms |       10.88.0.2 | GET      "/api/tags"
May 03 14:26:01 asuran-mkvi ollama[5361]: [GIN] 2024/05/03 - 14:26:01 | 200 |      11.181µs |       10.88.0.2 | GET      "/api/version"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.160-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.160-04:00 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.160-04:00 level=DEBUG source=amd_linux.go:152 msg="discovering VRAM for amdgpu devices"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.160-04:00 level=DEBUG source=amd_linux.go:171 msg="amdgpu devices [0]"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=amd_linux.go:246 msg="[0] amdgpu totalMemory 8176M"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=amd_linux.go:247 msg="[0] amdgpu freeMemory  8176M"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=DEBUG source=gpu.go:254 msg="rocm detected 1 devices with 7152M available memory"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=DEBUG source=amd_linux.go:152 msg="discovering VRAM for amdgpu devices"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=DEBUG source=amd_linux.go:171 msg="amdgpu devices [0]"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=amd_linux.go:246 msg="[0] amdgpu totalMemory 8176M"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=amd_linux.go:247 msg="[0] amdgpu freeMemory  8176M"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=DEBUG source=payload_common.go:94 msg="ordered list of LLM libraries to try [/tmp/ollama2062404392/runners/rocm/libext_server.so /tmp/ollama2062404392/runners/cpu_avx2/libext_server.so]"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.205-04:00 level=INFO source=dyn_ext_server.go:87 msg="Loading Dynamic llm server: /tmp/ollama2062404392/runners/rocm/libext_server.so"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.205-04:00 level=INFO source=dyn_ext_server.go:147 msg="Initializing llama server"
May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.205-04:00 level=DEBUG source=dyn_ext_server.go:148 msg="server params: {model:0x7fb5fc7a8640 n_ctx:2048 n_batch:512 n_threads:0 n_parallel:1 rope_freq_base:0 rope_freq_scale:0 memory_f16:true n_gpu_layers:33 main_gpu:0 use_mlock:false use_mmap:true numa:0 embedding:true lora_adapters:<nil> mmproj:<nil> verbose_logging:true _:[0 0 0 0 0 0 0]}"
May 03 14:26:08 asuran-mkvi ollama[5361]: [1714760768] system info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
May 03 14:26:08 asuran-mkvi ollama[5361]: [1714760768] Performing pre-initialization of GPU
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /var/lib/ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest))
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv   0:                       general.architecture str              = llama
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv   2:                          llama.block_count u32              = 32
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv  20:               general.quantization_version u32              = 2
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - type  f32:   65 tensors
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - type q4_0:  225 tensors
May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - type q6_K:    1 tensors
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_vocab: special tokens definition check successful ( 256/128256 ).
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: format           = GGUF V3 (latest)
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: arch             = llama
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: vocab type       = BPE
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_vocab          = 128256
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_merges         = 280147
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_ctx_train      = 8192
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_embd           = 4096
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_head           = 32
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_head_kv        = 8
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_layer          = 32
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_rot            = 128
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_embd_head_k    = 128
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_embd_head_v    = 128
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_gqa            = 4
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_embd_k_gqa     = 1024
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_embd_v_gqa     = 1024
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: f_norm_eps       = 0.0e+00
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: f_logit_scale    = 0.0e+00
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_ff             = 14336
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_expert         = 0
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_expert_used    = 0
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: causal attn      = 1
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: pooling type     = 0
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: rope type        = 0
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: rope scaling     = linear
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: freq_base_train  = 500000.0
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: freq_scale_train = 1
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_yarn_orig_ctx  = 8192
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: rope_finetuned   = unknown
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: ssm_d_conv       = 0
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: ssm_d_inner      = 0
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: ssm_d_state      = 0
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: ssm_dt_rank      = 0
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: model type       = 7B
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: model ftype      = Q4_0
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: model params     = 8.03 B
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: LF token         = 128 'Ä'
May 03 14:26:10 asuran-mkvi ollama[5361]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
May 03 14:26:10 asuran-mkvi ollama[5361]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
May 03 14:26:10 asuran-mkvi ollama[5361]: ggml_cuda_init: found 1 ROCm devices:
May 03 14:26:10 asuran-mkvi ollama[5361]:   Device 0: AMD Radeon RX Vega, compute capability 9.0, VMM: no
May 03 14:26:10 asuran-mkvi ollama[5361]: llm_load_tensors: ggml ctx size =    0.22 MiB
May 03 14:26:11 asuran-mkvi ollama[5361]: llm_load_tensors: offloading 32 repeating layers to GPU
May 03 14:26:11 asuran-mkvi ollama[5361]: llm_load_tensors: offloading non-repeating layers to GPU
May 03 14:26:11 asuran-mkvi ollama[5361]: llm_load_tensors: offloaded 33/33 layers to GPU
May 03 14:26:11 asuran-mkvi ollama[5361]: llm_load_tensors:      ROCm0 buffer size =  4155.99 MiB
May 03 14:26:11 asuran-mkvi ollama[5361]: llm_load_tensors:        CPU buffer size =   281.81 MiB
<!-- gh-comment-id:2093558062 --> @descention commented on GitHub (May 3, 2024): I also have a Vega 56. Mine is not throwing an error, just hanging on/after `llm_load_tensors`. `Linux asuran-mkvi 6.6.29 #1-NixOS SMP PREEMPT_DYNAMIC Sat Apr 27 15:11:44 UTC 2024 x86_64 GNU/Linux` `ollama version is 0.1.31` ``` May 03 14:24:38 asuran-mkvi systemd[1]: Started Server for local large language models. May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.754-04:00 level=INFO source=images.go:804 msg="total blobs: 61" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.755-04:00 level=INFO source=images.go:811 msg="total unused blobs removed: 0" May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. May 03 14:24:38 asuran-mkvi ollama[5361]: - using env: export GIN_MODE=release May 03 14:24:38 asuran-mkvi ollama[5361]: - using code: gin.SetMode(gin.ReleaseMode) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.PullModelHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.GenerateHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.ChatHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.EmbeddingsHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.CreateModelHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.PushModelHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.CopyModelHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.DeleteModelHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.ShowModelHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.CreateBlobHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.HeadBlobHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.ChatHandler (6 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.ListModelsHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.ListModelsHandler (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.756-04:00 level=INFO source=routes.go:1118 msg="Listening on [::]:11434 (version 0.1.31)" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.756-04:00 level=INFO source=payload_common.go:113 msg="Extracting dynamic libraries to /tmp/ollama2062404392/runners ..." May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=payload_common.go:140 msg="Dynamic LLM libraries [cpu_avx rocm cpu cpu_avx2]" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=payload_common.go:141 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=gpu.go:115 msg="Detecting GPU type" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libcudart.so*" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/tmp/ollama2062404392/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /run/opengl-driver/lib/libcudart.so** /nix/store/5kmapamwsb2q1nz3d335vpf4fbvxjplr-rocm-smi-6.0.2/lib/libcudart.so**]" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /run/opengl-driver/lib/libnvidia-ml.so* /nix/store/5kmapamwsb2q1nz3d335vpf4fbvxjplr-rocm-smi-6.0.2/lib/libnvidia-ml.so*]" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=amd_linux.go:152 msg="discovering VRAM for amdgpu devices" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=amd_linux.go:171 msg="amdgpu devices [0]" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=amd_linux.go:246 msg="[0] amdgpu totalMemory 8176M" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=INFO source=amd_linux.go:247 msg="[0] amdgpu freeMemory 8176M" May 03 14:24:38 asuran-mkvi ollama[5361]: time=2024-05-03T14:24:38.891-04:00 level=DEBUG source=gpu.go:254 msg="rocm detected 1 devices with 7152M available memory" May 03 14:25:01 asuran-mkvi ollama[5361]: time=2024-05-03T14:25:01.774-04:00 level=INFO source=routes.go:843 msg="skipping file: registry.ollama.ai/library/mixtral:8x7b-instruct-v0.1-q5_0" May 03 14:25:01 asuran-mkvi ollama[5361]: [GIN] 2024/05/03 - 14:25:01 | 200 | 3.929711ms | 10.88.0.2 | GET "/api/tags" May 03 14:25:01 asuran-mkvi ollama[5361]: [GIN] 2024/05/03 - 14:25:01 | 200 | 17.944µs | 10.88.0.2 | GET "/api/version" May 03 14:26:01 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:01.159-04:00 level=INFO source=routes.go:843 msg="skipping file: registry.ollama.ai/library/mixtral:8x7b-instruct-v0.1-q5_0" May 03 14:26:01 asuran-mkvi ollama[5361]: [GIN] 2024/05/03 - 14:26:01 | 200 | 3.397442ms | 10.88.0.2 | GET "/api/tags" May 03 14:26:01 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:01.205-04:00 level=INFO source=routes.go:843 msg="skipping file: registry.ollama.ai/library/mixtral:8x7b-instruct-v0.1-q5_0" May 03 14:26:01 asuran-mkvi ollama[5361]: [GIN] 2024/05/03 - 14:26:01 | 200 | 1.923652ms | 10.88.0.2 | GET "/api/tags" May 03 14:26:01 asuran-mkvi ollama[5361]: [GIN] 2024/05/03 - 14:26:01 | 200 | 11.181µs | 10.88.0.2 | GET "/api/version" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.160-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.160-04:00 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.160-04:00 level=DEBUG source=amd_linux.go:152 msg="discovering VRAM for amdgpu devices" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.160-04:00 level=DEBUG source=amd_linux.go:171 msg="amdgpu devices [0]" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=amd_linux.go:246 msg="[0] amdgpu totalMemory 8176M" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=amd_linux.go:247 msg="[0] amdgpu freeMemory 8176M" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=DEBUG source=gpu.go:254 msg="rocm detected 1 devices with 7152M available memory" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=DEBUG source=amd_linux.go:152 msg="discovering VRAM for amdgpu devices" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=DEBUG source=amd_linux.go:171 msg="amdgpu devices [0]" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=amd_linux.go:246 msg="[0] amdgpu totalMemory 8176M" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=amd_linux.go:247 msg="[0] amdgpu freeMemory 8176M" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.161-04:00 level=DEBUG source=payload_common.go:94 msg="ordered list of LLM libraries to try [/tmp/ollama2062404392/runners/rocm/libext_server.so /tmp/ollama2062404392/runners/cpu_avx2/libext_server.so]" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.205-04:00 level=INFO source=dyn_ext_server.go:87 msg="Loading Dynamic llm server: /tmp/ollama2062404392/runners/rocm/libext_server.so" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.205-04:00 level=INFO source=dyn_ext_server.go:147 msg="Initializing llama server" May 03 14:26:08 asuran-mkvi ollama[5361]: time=2024-05-03T14:26:08.205-04:00 level=DEBUG source=dyn_ext_server.go:148 msg="server params: {model:0x7fb5fc7a8640 n_ctx:2048 n_batch:512 n_threads:0 n_parallel:1 rope_freq_base:0 rope_freq_scale:0 memory_f16:true n_gpu_layers:33 main_gpu:0 use_mlock:false use_mmap:true numa:0 embedding:true lora_adapters:<nil> mmproj:<nil> verbose_logging:true _:[0 0 0 0 0 0 0]}" May 03 14:26:08 asuran-mkvi ollama[5361]: [1714760768] system info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | May 03 14:26:08 asuran-mkvi ollama[5361]: [1714760768] Performing pre-initialization of GPU May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /var/lib/ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest)) May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 0: general.architecture str = llama May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 2: llama.block_count u32 = 32 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 3: llama.context_length u32 = 8192 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 10: general.file_type u32 = 2 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - kv 20: general.quantization_version u32 = 2 May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - type f32: 65 tensors May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - type q4_0: 225 tensors May 03 14:26:08 asuran-mkvi ollama[5361]: llama_model_loader: - type q6_K: 1 tensors May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_vocab: special tokens definition check successful ( 256/128256 ). May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: format = GGUF V3 (latest) May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: arch = llama May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: vocab type = BPE May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_vocab = 128256 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_merges = 280147 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_ctx_train = 8192 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_embd = 4096 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_head = 32 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_head_kv = 8 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_layer = 32 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_rot = 128 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_embd_head_k = 128 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_embd_head_v = 128 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_gqa = 4 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_embd_k_gqa = 1024 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_embd_v_gqa = 1024 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: f_norm_eps = 0.0e+00 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: f_logit_scale = 0.0e+00 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_ff = 14336 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_expert = 0 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_expert_used = 0 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: causal attn = 1 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: pooling type = 0 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: rope type = 0 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: rope scaling = linear May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: freq_base_train = 500000.0 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: freq_scale_train = 1 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: n_yarn_orig_ctx = 8192 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: rope_finetuned = unknown May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: ssm_d_conv = 0 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: ssm_d_inner = 0 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: ssm_d_state = 0 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: ssm_dt_rank = 0 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: model type = 7B May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: model ftype = Q4_0 May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: model params = 8.03 B May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' May 03 14:26:08 asuran-mkvi ollama[5361]: llm_load_print_meta: LF token = 128 'Ä' May 03 14:26:10 asuran-mkvi ollama[5361]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no May 03 14:26:10 asuran-mkvi ollama[5361]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes May 03 14:26:10 asuran-mkvi ollama[5361]: ggml_cuda_init: found 1 ROCm devices: May 03 14:26:10 asuran-mkvi ollama[5361]: Device 0: AMD Radeon RX Vega, compute capability 9.0, VMM: no May 03 14:26:10 asuran-mkvi ollama[5361]: llm_load_tensors: ggml ctx size = 0.22 MiB May 03 14:26:11 asuran-mkvi ollama[5361]: llm_load_tensors: offloading 32 repeating layers to GPU May 03 14:26:11 asuran-mkvi ollama[5361]: llm_load_tensors: offloading non-repeating layers to GPU May 03 14:26:11 asuran-mkvi ollama[5361]: llm_load_tensors: offloaded 33/33 layers to GPU May 03 14:26:11 asuran-mkvi ollama[5361]: llm_load_tensors: ROCm0 buffer size = 4155.99 MiB May 03 14:26:11 asuran-mkvi ollama[5361]: llm_load_tensors: CPU buffer size = 281.81 MiB ```
Author
Owner

@descention commented on GitHub (May 4, 2024):

I pulled llama.cpp and am getting the same result, so I think it's not on Ollama for my issue.

> nix run github:ggerganov/llama.cpp#rocm -- -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -p "You are a helpful assistant. DO Answer questions. DO Keep your responses short, about two sentences. Where is france?" -ngl 33 -n 128 -c 0

main: build = 0 (unknown)
main: built with HIP version: 6.0.32831- for x86_64-unknown-linux-gnu
main: seed  = 1714787748
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from Meta-Llama-3-8B-Instruct-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 18
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/Meta-Llama-3-8B-Instruct-GGUF...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/groups_merged.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 88
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 6.14 GiB (6.56 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX Vega, compute capability 9.0, VMM: no
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  5871.99 MiB
llm_load_tensors:        CPU buffer size =   410.98 MiB
.
<!-- gh-comment-id:2093957064 --> @descention commented on GitHub (May 4, 2024): I pulled llama.cpp and am getting the same result, so I think it's not on Ollama for my issue. ``` > nix run github:ggerganov/llama.cpp#rocm -- -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -p "You are a helpful assistant. DO Answer questions. DO Keep your responses short, about two sentences. Where is france?" -ngl 33 -n 128 -c 0 main: build = 0 (unknown) main: built with HIP version: 6.0.32831- for x86_64-unknown-linux-gnu main: seed = 1714787748 llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from Meta-Llama-3-8B-Instruct-Q6_K.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 18 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - kv 22: quantize.imatrix.file str = /models/Meta-Llama-3-8B-Instruct-GGUF... llama_model_loader: - kv 23: quantize.imatrix.dataset str = /training_data/groups_merged.txt llama_model_loader: - kv 24: quantize.imatrix.entries_count i32 = 224 llama_model_loader: - kv 25: quantize.imatrix.chunks_count i32 = 88 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q6_K: 226 tensors llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q6_K llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 6.14 GiB (6.56 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX Vega, compute capability 9.0, VMM: no llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 5871.99 MiB llm_load_tensors: CPU buffer size = 410.98 MiB . ```
Author
Owner

@descention commented on GitHub (May 4, 2024):

@quwassar , What are your results if you export HSA_ENABLE_SDMA=0? See https://github.com/ROCm/ROCm/issues/2781#issuecomment-1974938958

<!-- gh-comment-id:2093964698 --> @descention commented on GitHub (May 4, 2024): @quwassar , What are your results if you `export HSA_ENABLE_SDMA=0`? See https://github.com/ROCm/ROCm/issues/2781#issuecomment-1974938958
Author
Owner

@mishrasidhant commented on GitHub (May 5, 2024):

Running into the same error using:

  • Vega 56
  • Linux
  • AMD cpu
  • v1.1.33

Logs

time=2024-05-05T16:16:31.709-04:00 level=INFO source=images.go:828 msg="total blobs: 5"
time=2024-05-05T16:16:31.709-04:00 level=INFO source=images.go:835 msg="total unused blobs removed: 0"
time=2024-05-05T16:16:31.709-04:00 level=INFO source=routes.go:1071 msg="Listening on 127.0.0.1:11434 (version 0.1.33)"
time=2024-05-05T16:16:31.709-04:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama803774002/runners
time=2024-05-05T16:16:35.257-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]"
time=2024-05-05T16:16:35.257-04:00 level=INFO source=gpu.go:96 msg="Detecting GPUs"
time=2024-05-05T16:16:35.266-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-05T16:16:35.266-04:00 level=WARN source=amd_linux.go:49 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-05-05T16:16:35.266-04:00 level=INFO source=amd_linux.go:217 msg="amdgpu memory" gpu=0 total="8176.0 MiB"
time=2024-05-05T16:16:35.266-04:00 level=INFO source=amd_linux.go:218 msg="amdgpu memory" gpu=0 available="8176.0 MiB"
time=2024-05-05T16:16:35.272-04:00 level=INFO source=amd_linux.go:276 msg="amdgpu is supported" gpu=0 gpu_type=gfx900
[GIN] 2024/05/05 - 16:17:34 | 200 |       44.31µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/05/05 - 16:17:34 | 200 |     1.66282ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/05/05 - 16:17:34 | 200 |     597.683µs |       127.0.0.1 | POST     "/api/show"
time=2024-05-05T16:17:34.671-04:00 level=INFO source=gpu.go:96 msg="Detecting GPUs"
time=2024-05-05T16:17:34.693-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-05T16:17:34.693-04:00 level=WARN source=amd_linux.go:49 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-05-05T16:17:34.694-04:00 level=INFO source=amd_linux.go:217 msg="amdgpu memory" gpu=0 total="8176.0 MiB"
time=2024-05-05T16:17:34.694-04:00 level=INFO source=amd_linux.go:218 msg="amdgpu memory" gpu=0 available="8176.0 MiB"
time=2024-05-05T16:17:34.699-04:00 level=INFO source=amd_linux.go:276 msg="amdgpu is supported" gpu=0 gpu_type=gfx900
time=2024-05-05T16:17:36.853-04:00 level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="8176.0 MiB" memory.required.full="5033.0 MiB" memory.required.partial="5033.0 MiB" memory.required.kv="256.0 MiB" memory.weights.total="4156.0 MiB" memory.weights.repeating="3745.0 MiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-05-05T16:17:36.854-04:00 level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="8176.0 MiB" memory.required.full="5033.0 MiB" memory.required.partial="5033.0 MiB" memory.required.kv="256.0 MiB" memory.weights.total="4156.0 MiB" memory.weights.repeating="3745.0 MiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-05-05T16:17:36.854-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-05T16:17:36.857-04:00 level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama803774002/runners/rocm_v60002/ollama_llama_server --model /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 42245"
time=2024-05-05T16:17:36.858-04:00 level=INFO source=sched.go:340 msg="loaded runners" count=1
time=2024-05-05T16:17:36.858-04:00 level=INFO source=server.go:432 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140024073976896","timestamp":1714940256}
{"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2823,"msg":"build info","tid":"140024073976896","timestamp":1714940256}
{"function":"main","level":"INFO","line":2830,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140024073976896","timestamp":1714940256,"total_threads":16}
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX Vega, compute capability 9.0, VMM: no
CUDA error: invalid argument
  current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:2701
  hipMemGetInfo(free, total)
GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
[New LWP 402119]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f5a9b5d6ec7 in wait4 () from /usr/lib/libc.so.6
#0  0x00007f5a9b5d6ec7 in wait4 () from /usr/lib/libc.so.6
#1  0x0000000002551264 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#2  0x00000000025521ef in ggml_backend_cuda_get_device_memory ()
#3  0x000000000241c1d0 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) ()
#4  0x00000000023fe6c5 in llama_model_load(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, llama_model&, llama_model_params&) ()
#5  0x00000000023fbf21 in llama_load_model_from_file ()
#6  0x00000000023b6c82 in llama_init_from_gpt_params(gpt_params&) ()
#7  0x0000000002312744 in llama_server_context::load_model(gpt_params const&) ()
#8  0x00000000022fe341 in main ()
[Inferior 1 (process 402118) detached]

It's worth noting:

  • The above logs are when installing via distribution package manager
  • Bulding v0.1.33 from source leads to the same error ( rocm packages installed prior to build)
  • The docker image ollama/ollama:rocm also fails with the same logs as above
  • Installing the distro package (without rocm packages) runs fine on CPU
  • The CPU docker image runs fine
<!-- gh-comment-id:2094924067 --> @mishrasidhant commented on GitHub (May 5, 2024): Running into the same error using: - Vega 56 - Linux - AMD cpu - `v1.1.33` ### Logs ``` time=2024-05-05T16:16:31.709-04:00 level=INFO source=images.go:828 msg="total blobs: 5" time=2024-05-05T16:16:31.709-04:00 level=INFO source=images.go:835 msg="total unused blobs removed: 0" time=2024-05-05T16:16:31.709-04:00 level=INFO source=routes.go:1071 msg="Listening on 127.0.0.1:11434 (version 0.1.33)" time=2024-05-05T16:16:31.709-04:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama803774002/runners time=2024-05-05T16:16:35.257-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]" time=2024-05-05T16:16:35.257-04:00 level=INFO source=gpu.go:96 msg="Detecting GPUs" time=2024-05-05T16:16:35.266-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-05-05T16:16:35.266-04:00 level=WARN source=amd_linux.go:49 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-05-05T16:16:35.266-04:00 level=INFO source=amd_linux.go:217 msg="amdgpu memory" gpu=0 total="8176.0 MiB" time=2024-05-05T16:16:35.266-04:00 level=INFO source=amd_linux.go:218 msg="amdgpu memory" gpu=0 available="8176.0 MiB" time=2024-05-05T16:16:35.272-04:00 level=INFO source=amd_linux.go:276 msg="amdgpu is supported" gpu=0 gpu_type=gfx900 [GIN] 2024/05/05 - 16:17:34 | 200 | 44.31µs | 127.0.0.1 | HEAD "/" [GIN] 2024/05/05 - 16:17:34 | 200 | 1.66282ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/05/05 - 16:17:34 | 200 | 597.683µs | 127.0.0.1 | POST "/api/show" time=2024-05-05T16:17:34.671-04:00 level=INFO source=gpu.go:96 msg="Detecting GPUs" time=2024-05-05T16:17:34.693-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-05-05T16:17:34.693-04:00 level=WARN source=amd_linux.go:49 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-05-05T16:17:34.694-04:00 level=INFO source=amd_linux.go:217 msg="amdgpu memory" gpu=0 total="8176.0 MiB" time=2024-05-05T16:17:34.694-04:00 level=INFO source=amd_linux.go:218 msg="amdgpu memory" gpu=0 available="8176.0 MiB" time=2024-05-05T16:17:34.699-04:00 level=INFO source=amd_linux.go:276 msg="amdgpu is supported" gpu=0 gpu_type=gfx900 time=2024-05-05T16:17:36.853-04:00 level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="8176.0 MiB" memory.required.full="5033.0 MiB" memory.required.partial="5033.0 MiB" memory.required.kv="256.0 MiB" memory.weights.total="4156.0 MiB" memory.weights.repeating="3745.0 MiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" time=2024-05-05T16:17:36.854-04:00 level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="8176.0 MiB" memory.required.full="5033.0 MiB" memory.required.partial="5033.0 MiB" memory.required.kv="256.0 MiB" memory.weights.total="4156.0 MiB" memory.weights.repeating="3745.0 MiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" time=2024-05-05T16:17:36.854-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-05-05T16:17:36.857-04:00 level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama803774002/runners/rocm_v60002/ollama_llama_server --model /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 42245" time=2024-05-05T16:17:36.858-04:00 level=INFO source=sched.go:340 msg="loaded runners" count=1 time=2024-05-05T16:17:36.858-04:00 level=INFO source=server.go:432 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140024073976896","timestamp":1714940256} {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2823,"msg":"build info","tid":"140024073976896","timestamp":1714940256} {"function":"main","level":"INFO","line":2830,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140024073976896","timestamp":1714940256,"total_threads":16} llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab: llm_load_vocab: ************************************ llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED! llm_load_vocab: CONSIDER REGENERATING THE MODEL llm_load_vocab: ************************************ llm_load_vocab: llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX Vega, compute capability 9.0, VMM: no CUDA error: invalid argument current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:2701 hipMemGetInfo(free, total) GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error" [New LWP 402119] [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". 0x00007f5a9b5d6ec7 in wait4 () from /usr/lib/libc.so.6 #0 0x00007f5a9b5d6ec7 in wait4 () from /usr/lib/libc.so.6 #1 0x0000000002551264 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () #2 0x00000000025521ef in ggml_backend_cuda_get_device_memory () #3 0x000000000241c1d0 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () #4 0x00000000023fe6c5 in llama_model_load(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, llama_model&, llama_model_params&) () #5 0x00000000023fbf21 in llama_load_model_from_file () #6 0x00000000023b6c82 in llama_init_from_gpt_params(gpt_params&) () #7 0x0000000002312744 in llama_server_context::load_model(gpt_params const&) () #8 0x00000000022fe341 in main () [Inferior 1 (process 402118) detached] ``` It's worth noting: - The above logs are when installing via distribution package manager - Bulding `v0.1.33` from source leads to the same error ( `rocm` packages installed prior to build) - The docker image `ollama/ollama:rocm` also fails with the same logs as above - Installing the distro package (without `rocm` packages) runs fine on CPU - The CPU docker image runs fine
Author
Owner

@mishrasidhant commented on GitHub (May 6, 2024):

@quwassar , What are your results if you export HSA_ENABLE_SDMA=0? See ROCm/ROCm#2781 (comment)

No luck, same results as above

<!-- gh-comment-id:2095195580 --> @mishrasidhant commented on GitHub (May 6, 2024): > @quwassar , What are your results if you `export HSA_ENABLE_SDMA=0`? See [ROCm/ROCm#2781 (comment)](https://github.com/ROCm/ROCm/issues/2781#issuecomment-1974938958) No luck, same results as above
Author
Owner

@quwassar commented on GitHub (May 6, 2024):

No, this is not help me
image

<!-- gh-comment-id:2095359837 --> @quwassar commented on GitHub (May 6, 2024): No, this is not help me <img width="918" alt="image" src="https://github.com/ollama/ollama/assets/70705054/54a5cf39-b164-4843-8895-5c8ba7e45e13">
Author
Owner

@quwassar commented on GitHub (May 6, 2024):

Hm, I upgrade ollama to latest version and get another error:
image

<!-- gh-comment-id:2095507886 --> @quwassar commented on GitHub (May 6, 2024): Hm, I upgrade ollama to latest version and get another error: <img width="917" alt="image" src="https://github.com/ollama/ollama/assets/70705054/911d13a3-69d3-4cb6-b487-7f5f3bc9be09">
Author
Owner

@quwassar commented on GitHub (May 14, 2024):

P.S>
image

<!-- gh-comment-id:2109777085 --> @quwassar commented on GitHub (May 14, 2024): P.S> <img width="866" alt="image" src="https://github.com/ollama/ollama/assets/70705054/a3bbef43-ad53-4a4c-9827-4c3e2338562c">
Author
Owner

@quwassar commented on GitHub (May 23, 2024):

That's me, again
Problem fixed on 0.1.38 version ollama + add HSA_ENABLE_SDMA=0 variable

<!-- gh-comment-id:2126932742 --> @quwassar commented on GitHub (May 23, 2024): That's me, again Problem fixed on 0.1.38 version ollama + add HSA_ENABLE_SDMA=0 variable
Author
Owner

@gorgonical commented on GitHub (May 29, 2024):

I have this problem with an RX 6600 on both 1.39 and 1.38. I've tried HSA_ENABLE_SDMA=0 and it doesn't work. When using PyTorch+rocm6 I get offload so I know the rocm6 runtime supports my card, at least. I do have to use HSA_OVERRIDE_GFX_VERSION=10.3.0 to get this far.

I have looked through just about everything in the world regarding this, short of actually building the rocm platform myself and trying to substitute locally-built drivers.

I don't know very much about this, but it seems to be from llama.cpp, maybe?

ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6600, compute capability 10.3, VMM: no
CUDA error: invalid argument
  current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollamaa.cpp/ggml-cuda.cu:2701
  hipMemGetInfo(free, total)
GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error"

I'm happy to be instructed about troubleshooting this. Thinking a little about it, since it's a backend function that's failing, maybe the packaged rocm drivers that ollama distributes need to be updated?

<!-- gh-comment-id:2136592206 --> @gorgonical commented on GitHub (May 29, 2024): I have this problem with an RX 6600 on both 1.39 and 1.38. I've tried HSA_ENABLE_SDMA=0 and it doesn't work. When using PyTorch+rocm6 I get offload so I know the rocm6 runtime supports my card, at least. I do have to use `HSA_OVERRIDE_GFX_VERSION=10.3.0` to get this far. I have looked through just about everything in the world regarding this, short of actually building the rocm platform myself and trying to substitute locally-built drivers. I don't know very much about this, but it seems to be from llama.cpp, maybe? ``` ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6600, compute capability 10.3, VMM: no CUDA error: invalid argument current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollamaa.cpp/ggml-cuda.cu:2701 hipMemGetInfo(free, total) GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error" ``` I'm happy to be instructed about troubleshooting this. Thinking a little about it, since it's a backend function that's failing, maybe the packaged rocm drivers that ollama distributes need to be updated?
Author
Owner

@gorgonical commented on GitHub (May 30, 2024):

Update: I solved this problem for myself.

After building the entire ROCm stack myself, I had the same problem. I dug through the ROCm code and instrumented a few things, and it turned out that I was using a kernel version that was too old. In my case, the minor version of my kernel module was 6, but the hipMemGetInfo call eventually makes an HSA-KMT thunk call that requires kernel module version 9 or later. Updating my kernel version to get the newer version of the amdgpu driver that exposes /dev/kfd solved the problem, with a locally-built ROCm stack.

@quwassar doesn't tell his Linux kernel version, but unless it's a reasonably recent version (I'm using 6.9.2. now) I expect it's a kernel version problem.

<!-- gh-comment-id:2140653322 --> @gorgonical commented on GitHub (May 30, 2024): Update: I solved this problem for myself. After building the entire ROCm stack myself, I had the same problem. I dug through the ROCm code and instrumented a few things, and it turned out that I was using a kernel version that was too old. In my case, the minor version of my kernel module was 6, but the hipMemGetInfo call eventually makes an HSA-KMT thunk call that requires kernel module version 9 or later. Updating my kernel version to get the newer version of the amdgpu driver that exposes `/dev/kfd` solved the problem, with a locally-built ROCm stack. @quwassar doesn't tell his Linux kernel version, but unless it's a reasonably recent version (I'm using 6.9.2. now) I expect it's a kernel version problem.
Author
Owner

@dhiltgen commented on GitHub (Jun 21, 2024):

I believe the Vega 56 should be working properly now with 0.1.45. Please upgrade, and if you're still having problems, please share an updated server log and I'll reopen the issue.

<!-- gh-comment-id:2183591173 --> @dhiltgen commented on GitHub (Jun 21, 2024): I believe the Vega 56 should be working properly [now with 0.1.45](https://github.com/ollama/ollama/pull/4875). Please upgrade, and if you're still having problems, please share an updated server log and I'll reopen the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28138