[GH-ISSUE #7564] Ollama fails to run with ROCm 6.2.2 in Arch packaging #30577

Closed
opened 2026-04-22 10:20:37 -05:00 by GiteaMirror · 43 comments
Owner

Originally created by @kode54 on GitHub (Nov 7, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7564

What is the issue?

I figure this is a downstream packaging issue, but this could possibly do with some upstream help. Arch is at 0.3.12, and has recently attempted to package against their ROCm 6.2.2 packages in the testing repositories, which I am working on signing off against as a tester.

The models llama3.2, llama3.1, llama3, and llama2 all fail to load and run on my RX 7700 XT. At least llama3.1 is verified to work with this repository's official binaries package, as installed with the installer script to /usr/local.

The 3+ series fail with:

Error: llama runner process has terminated: CUDA error

The 2 model fails with a bit more verbose error:

Error: llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1861
  hipblasGemmStridedBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const char *) src0_f16, HIPBLAS_R_16F, nb01/nb00, nb02/nb00, (const char *) src1_f16, HIPBLAS_R_16F, nb11/nb10, nb12/nb10, beta, ( char *) dst_t, cu_data_type, ne01, nb2/nb0, ne12*ne13, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
/build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error

I will be attempting to upgrade the Arch package to build ollama 0.4.0, and test further. I'm also reporting this to Arch packaging for ollama-rocm, since again, it's likely a downstream packaging issue, but I'm not sure exactly what issue would cause it, be it the particular version of ROCm, or something else with the packaging.

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.3.12

Originally created by @kode54 on GitHub (Nov 7, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7564 ### What is the issue? I figure this is a downstream packaging issue, but this could possibly do with some upstream help. Arch is at 0.3.12, and has recently attempted to package against their ROCm 6.2.2 packages in the testing repositories, which I am working on signing off against as a tester. The models `llama3.2`, `llama3.1`, `llama3`, and `llama2` all fail to load and run on my RX 7700 XT. At least `llama3.1` is verified to work with this repository's official binaries package, as installed with the installer script to `/usr/local`. The 3+ series fail with: ``` Error: llama runner process has terminated: CUDA error ``` The 2 model fails with a bit more verbose error: ``` Error: llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1861 hipblasGemmStridedBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const char *) src0_f16, HIPBLAS_R_16F, nb01/nb00, nb02/nb00, (const char *) src1_f16, HIPBLAS_R_16F, nb11/nb10, nb12/nb10, beta, ( char *) dst_t, cu_data_type, ne01, nb2/nb0, ne12*ne13, cu_compute_type, HIPBLAS_GEMM_DEFAULT) /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error ``` I will be attempting to upgrade the Arch package to build ollama 0.4.0, and test further. I'm also reporting this to Arch packaging for `ollama-rocm`, since again, it's likely a downstream packaging issue, but I'm not sure exactly what issue would cause it, be it the particular version of ROCm, or something else with the packaging. ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.3.12
GiteaMirror added the buildbug labels 2026-04-22 10:20:37 -05:00
Author
Owner

@ydalton commented on GitHub (Nov 8, 2024):

You might want to attach the logs from the ollama server, as I'm having the exact same issue that you are having in an Arch Linux distrobox. When I run the llama3 model, I get your exact message, and this is what I get on the server's logs:

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 32, N: 2, K: 128, alpha: 1, row_stride_a: 1, col_stride_a: 1024, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 32, row_stride_d: 1, col_stride_d: 32, beta: 0, batch_count: 32, strided_batch: false, stride_a: 32768, stride_b: 8192, stride_c: 64, stride_d: 64, atomics_mode: atomics_allowed }
Alpha value 7.21875 doesn't match that set in problem: 1
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
CUDA error: CUBLAS_STATUS_INTERNAL_ERROR
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890
  hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
/build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error
time=2024-11-08T12:07:01.858+01:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
time=2024-11-08T12:07:03.911+01:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR\n  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890\n  hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)\n/build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error"

Edit: I have tested the Docker container (obtain with docker pull ollama/ollama:rocm) and I did not get any of these errors whatsoever.

<!-- gh-comment-id:2464428301 --> @ydalton commented on GitHub (Nov 8, 2024): You might want to attach the logs from the ollama server, as I'm having the exact same issue that you are having in an Arch Linux distrobox. When I run the `llama3` model, I get your exact message, and this is what I get on the server's logs: ``` rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 32, N: 2, K: 128, alpha: 1, row_stride_a: 1, col_stride_a: 1024, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 32, row_stride_d: 1, col_stride_d: 32, beta: 0, batch_count: 32, strided_batch: false, stride_a: 32768, stride_b: 8192, stride_c: 64, stride_d: 64, atomics_mode: atomics_allowed } Alpha value 7.21875 doesn't match that set in problem: 1 This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. CUDA error: CUBLAS_STATUS_INTERNAL_ERROR current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890 hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT) /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error time=2024-11-08T12:07:01.858+01:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" time=2024-11-08T12:07:03.911+01:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR\n current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890\n hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)\n/build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error" ``` Edit: I have tested the Docker container (obtain with `docker pull ollama/ollama:rocm`) and I did not get any of these errors whatsoever.
Author
Owner

@kode54 commented on GitHub (Nov 8, 2024):

I also tested the install.sh version from this repository and it works. I also reported to downstream Arch packaging. They're having a bit of a problem getting 0.4.0 to build.

<!-- gh-comment-id:2464683781 --> @kode54 commented on GitHub (Nov 8, 2024): I also tested the install.sh version from this repository and it works. I also reported to downstream Arch packaging. They're having a bit of a problem getting 0.4.0 to build.
Author
Owner

@dhiltgen commented on GitHub (Nov 8, 2024):

PR #7499 will hopefully help with the downstream packaging.

I haven't seen that ROCm failure mode before, so I'm not sure exactly what went wrong. It might help to build on another distro (ubuntu, etc.) with official ROCm packages from AMD, capture the log, and then compare the flags getting passed on the arch build to see if there's anything obvious that jumps out.

<!-- gh-comment-id:2465433074 --> @dhiltgen commented on GitHub (Nov 8, 2024): PR #7499 will hopefully help with the downstream packaging. I haven't seen that ROCm failure mode before, so I'm not sure exactly what went wrong. It might help to build on another distro (ubuntu, etc.) with official ROCm packages from AMD, capture the log, and then compare the flags getting passed on the arch build to see if there's anything obvious that jumps out.
Author
Owner

@lubosz commented on GitHub (Nov 9, 2024):

@dhiltgen Unfortunately your make_targets branch does not have any effect on this issue.

The ROCm CUBLAS_STATUS_INTERNAL_ERROR error remains the same. I also get a long go stacktrace, although it's most likely irrelevant.

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 32, N: 27, K: 128, alpha: 1, row_stride_a: 1, col_stride_a: 1024, row_stride_b: 1, col_stride_b: 3072, row_stride_c: 1, col_stride_c: 32, row_stride_d: 1, col_stride_d: 32, beta: 0, batch_count: 24, strided_batch: false, stride_a: 32768, stride_b: 82944, stride_c: 864, stride_d: 864, atomics_mode: atomics_allowed }
Alpha value 7.21875 doesn't match that set in problem: 1
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
CUDA error: CUBLAS_STATUS_INTERNAL_ERROR
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at ggml-cuda.cu:1934
  hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
ggml-cuda.cu:132: CUDA error
ptrace: Operation not permitted.
No stack.
The program is not being run.
SIGABRT: abort

Same behaviour on latest main and make_targets built using make -j12 and go build . So it's independent from the Arch Linux ollama-rocm package, but might be related on how Arch packages ROCm 6.2.

The exact same issue also occurs in llama.cpp:
https://github.com/ggerganov/llama.cpp/issues/10234

Speaking about make targets, the AMDGPU_TARGETS variable from the docs doesn't seem to be actually used in the build. See https://github.com/ollama/ollama/blob/main/docs/development.md#linux-rocm-amd

When setting AMDGPU_TARGETS=gfx1030 it still builds for all architectures. A git grep AMDGPU_TARGETS also supports that the variable is not used in the ollama build.
I needed this patch to actually only build for mine:

 llama/make/Makefile.rocm | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/llama/make/Makefile.rocm b/llama/make/Makefile.rocm
index 97a1fff7..331ef7d1 100644
--- a/llama/make/Makefile.rocm
+++ b/llama/make/Makefile.rocm
@@ -5,8 +5,8 @@
 
 include make/common-defs.make
 
-HIP_ARCHS_COMMON := gfx900 gfx940 gfx941 gfx942 gfx1010 gfx1012 gfx1030 gfx1100 gfx1101 gfx1102
-HIP_ARCHS_LINUX := gfx906:xnack- gfx908:xnack- gfx90a:xnack+ gfx90a:xnack-
+HIP_ARCHS_COMMON := gfx1030
+HIP_ARCHS_LINUX := 
 
 ifeq ($(OS),windows)
 	GPU_LIB_DIR_WIN := $(shell cygpath -m -s "$(HIP_PATH)/bin")
<!-- gh-comment-id:2466253042 --> @lubosz commented on GitHub (Nov 9, 2024): @dhiltgen Unfortunately your `make_targets` branch does not have any effect on this issue. The ROCm `CUBLAS_STATUS_INTERNAL_ERROR` error remains the same. I also get a long go stacktrace, although it's most likely irrelevant. ``` rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 32, N: 27, K: 128, alpha: 1, row_stride_a: 1, col_stride_a: 1024, row_stride_b: 1, col_stride_b: 3072, row_stride_c: 1, col_stride_c: 32, row_stride_d: 1, col_stride_d: 32, beta: 0, batch_count: 24, strided_batch: false, stride_a: 32768, stride_b: 82944, stride_c: 864, stride_d: 864, atomics_mode: atomics_allowed } Alpha value 7.21875 doesn't match that set in problem: 1 This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. CUDA error: CUBLAS_STATUS_INTERNAL_ERROR current device: 0, in function ggml_cuda_mul_mat_batched_cublas at ggml-cuda.cu:1934 hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT) ggml-cuda.cu:132: CUDA error ptrace: Operation not permitted. No stack. The program is not being run. SIGABRT: abort ``` Same behaviour on latest `main` and `make_targets` built using `make -j12` and `go build .` So it's independent from the Arch Linux `ollama-rocm` package, but might be related on how Arch packages ROCm 6.2. The exact same issue also occurs in `llama.cpp`: https://github.com/ggerganov/llama.cpp/issues/10234 Speaking about make targets, the `AMDGPU_TARGETS` variable from the docs doesn't seem to be actually used in the build. See https://github.com/ollama/ollama/blob/main/docs/development.md#linux-rocm-amd When setting `AMDGPU_TARGETS=gfx1030` it still builds for all architectures. A `git grep AMDGPU_TARGETS` also supports that the variable is not used in the ollama build. I needed this patch to actually only build for mine: ```diff llama/make/Makefile.rocm | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/llama/make/Makefile.rocm b/llama/make/Makefile.rocm index 97a1fff7..331ef7d1 100644 --- a/llama/make/Makefile.rocm +++ b/llama/make/Makefile.rocm @@ -5,8 +5,8 @@ include make/common-defs.make -HIP_ARCHS_COMMON := gfx900 gfx940 gfx941 gfx942 gfx1010 gfx1012 gfx1030 gfx1100 gfx1101 gfx1102 -HIP_ARCHS_LINUX := gfx906:xnack- gfx908:xnack- gfx90a:xnack+ gfx90a:xnack- +HIP_ARCHS_COMMON := gfx1030 +HIP_ARCHS_LINUX := ifeq ($(OS),windows) GPU_LIB_DIR_WIN := $(shell cygpath -m -s "$(HIP_PATH)/bin") ```
Author
Owner

@nonetrix commented on GitHub (Nov 10, 2024):

I get this from ollama serve

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 32, N: 2, K: 256, alpha: 1, row_stride_a: 1, col_stride_a: 2048, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 32, row_stride_d: 1, col_stride_d: 32, beta: 0, batch_count: 16, strided_batch: false, stride_a: 65536, stride_b: 8192, stride_c: 64, stride_d: 64, atomics_mode: atomics_allowed }
Alpha value 7.21875 doesn't match that set in problem: 1
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
CUDA error: CUBLAS_STATUS_INTERNAL_ERROR
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890
  hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
/build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error
ptrace: Operation not permitted.
No stack.
The program is not being run.
time=2024-11-10T02:22:20.061-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
time=2024-11-10T02:22:22.567-06:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR\n  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890\n  hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)\n/build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error"
[GIN] 2024/11/10 - 02:22:22 | 500 |  7.073751523s |       127.0.0.1 | POST     "/api/generate"
<!-- gh-comment-id:2466638150 --> @nonetrix commented on GitHub (Nov 10, 2024): I get this from ollama serve ``` rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 32, N: 2, K: 256, alpha: 1, row_stride_a: 1, col_stride_a: 2048, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 32, row_stride_d: 1, col_stride_d: 32, beta: 0, batch_count: 16, strided_batch: false, stride_a: 65536, stride_b: 8192, stride_c: 64, stride_d: 64, atomics_mode: atomics_allowed } Alpha value 7.21875 doesn't match that set in problem: 1 This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. CUDA error: CUBLAS_STATUS_INTERNAL_ERROR current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890 hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT) /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error ptrace: Operation not permitted. No stack. The program is not being run. time=2024-11-10T02:22:20.061-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" time=2024-11-10T02:22:22.567-06:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR\n current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890\n hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)\n/build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error" [GIN] 2024/11/10 - 02:22:22 | 500 | 7.073751523s | 127.0.0.1 | POST "/api/generate" ```
Author
Owner

@lubosz commented on GitHub (Nov 10, 2024):

Running ollama serve with AMD_LOG_LEVEL=3 results only in one hipErrorNotReady error:
ollama_serve_amd_log_level_3_llama3.log

While llama.cpp gives me many hipErrorNotFound errors when run with a similar model (llama 3) and resulting in the same CUBLAS_STATUS_INTERNAL_ERROR:

:1:hip_code_object.cpp      :1006: 145439538784 us: [pid:41455 tid:0x76a7c04a03c0] Cannot find the function: Cijk_Alik_Bljk_HSS_BH_MT64x32x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_PKA0_SIA1_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW2_VSn1_VW2_VWB2_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :84  : 145439538800 us: [pid:41455 tid:0x76a7c04a03c0] Cannot find the function: Cijk_Alik_Bljk_HSS_BH_MT64x32x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_PKA0_SIA1_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW2_VSn1_VW2_VWB2_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x444c8f30
:3:hip_module.cpp           :85  : 145439538811 us: [pid:41455 tid:0x76a7c04a03c0] hipModuleGetFunction: Returned hipErrorNotFound : 
:3:hip_error.cpp            :36  : 145439538815 us: [pid:41455 tid:0x76a7c04a03c0]  hipGetLastError (  ) 
:3:hip_module.cpp           :74  : 145439538819 us: [pid:41455 tid:0x76a7c04a03c0]  hipModuleGetFunction ( 0x7ffd364e0b60, 0x5d95441624f0, Cijk_Alik_Bljk_HSS_BH_MT64x32x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_PKA0_SIA1_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW2_VSn1_VW2_VWB2_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 ) 
<!-- gh-comment-id:2466679859 --> @lubosz commented on GitHub (Nov 10, 2024): Running ollama serve with `AMD_LOG_LEVEL=3` results only in one `hipErrorNotReady` error: [ollama_serve_amd_log_level_3_llama3.log](https://github.com/user-attachments/files/17691064/ollama_serve_amd_log_level_3_llama3.log) While `llama.cpp` gives me many `hipErrorNotFound` errors when run with a similar model (llama 3) and resulting in the same `CUBLAS_STATUS_INTERNAL_ERROR`: ``` :1:hip_code_object.cpp :1006: 145439538784 us: [pid:41455 tid:0x76a7c04a03c0] Cannot find the function: Cijk_Alik_Bljk_HSS_BH_MT64x32x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_PKA0_SIA1_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW2_VSn1_VW2_VWB2_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 :1:hip_module.cpp :84 : 145439538800 us: [pid:41455 tid:0x76a7c04a03c0] Cannot find the function: Cijk_Alik_Bljk_HSS_BH_MT64x32x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_PKA0_SIA1_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW2_VSn1_VW2_VWB2_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x444c8f30 :3:hip_module.cpp :85 : 145439538811 us: [pid:41455 tid:0x76a7c04a03c0] hipModuleGetFunction: Returned hipErrorNotFound : :3:hip_error.cpp :36 : 145439538815 us: [pid:41455 tid:0x76a7c04a03c0] hipGetLastError ( ) :3:hip_module.cpp :74 : 145439538819 us: [pid:41455 tid:0x76a7c04a03c0] hipModuleGetFunction ( 0x7ffd364e0b60, 0x5d95441624f0, Cijk_Alik_Bljk_HSS_BH_MT64x32x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS2_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_PKA0_SIA1_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW2_VSn1_VW2_VWB2_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 ) ```
Author
Owner

@lubosz commented on GitHub (Nov 10, 2024):

This seems to be the same error on Gentoo:
https://github.com/ollama/ollama/issues/6857

<!-- gh-comment-id:2466680810 --> @lubosz commented on GitHub (Nov 10, 2024): This seems to be the same error on Gentoo: https://github.com/ollama/ollama/issues/6857
Author
Owner

@Stefanomarton commented on GitHub (Nov 12, 2024):

Same problem here, latest update of ollama from arch repo is actually broken due to rocm-core.
I tried downgrading to rocm 6.0.2 with no success

Installing ollama using the scripts in the docs solved the issue for me.

<!-- gh-comment-id:2471656013 --> @Stefanomarton commented on GitHub (Nov 12, 2024): Same problem here, latest update of ollama from arch repo is actually broken due to rocm-core. I tried downgrading to rocm 6.0.2 with no success Installing ollama using the scripts in the docs solved the issue for me.
Author
Owner

@unclemusclez commented on GitHub (Nov 13, 2024):

ubuntu 24.04, gfx906, origin main

i get /tmp/ollama719590581/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such

these files exist and are compiled, however, i moved the file from the source code folder to /opt/ollama/lib/libggml_rocm.so and it still wouldn't find it under the LD_LIBRARY_PATH env setting. not sure what the Environment Variable is that it is looking for, but i don't compile this on the machine locally, and as mentioned above, it's building every driver, not just the AMD GPU TARGET.

<!-- gh-comment-id:2472115611 --> @unclemusclez commented on GitHub (Nov 13, 2024): ubuntu 24.04, gfx906, origin main i get `/tmp/ollama719590581/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such` these files exist and are compiled, however, i moved the file from the source code folder to `/opt/ollama/lib/libggml_rocm.so` and it still wouldn't find it under the `LD_LIBRARY_PATH` env setting. not sure what the Environment Variable is that it is looking for, but i don't compile this on the machine locally, and as mentioned above, it's building every driver, not just the AMD GPU TARGET.
Author
Owner

@zw963 commented on GitHub (Nov 13, 2024):

I copied my original issue to here for reference.

What is the issue?

I use Arch linux, i update my package to latest today.

Following is my upgraded version:

[2024-11-11T17:25:48+0800] [ALPM] upgraded rocm-opencl-sdk (6.0.2-1 -> 6.2.2-1)
[2024-11-11T17:25:48+0800] [ALPM] upgraded python-pytorch-rocm (2.3.1-8 -> 2.5.1-3)
[2024-11-11T17:25:45+0800] [ALPM] upgraded rocm-hip-sdk (6.0.2-1 -> 6.2.2-1)
[2024-11-11T17:25:45+0800] [ALPM] upgraded rocm-hip-libraries (6.0.2-1 -> 6.2.2-1)
[2024-11-11T17:25:44+0800] [ALPM] upgraded rocm-smi-lib (6.0.2-1 -> 6.2.2-1)
[2024-11-11T17:25:44+0800] [ALPM] upgraded rocm-hip-runtime (6.0.2-1 -> 6.2.2-1)
[2024-11-11T17:25:44+0800] [ALPM] upgraded rocm-cmake (6.0.2-1 -> 6.2.2-1)
[2024-11-11T17:25:44+0800] [ALPM] upgraded rocm-language-runtime (6.0.2-1 -> 6.2.2-1)
[2024-11-11T17:25:41+0800] [ALPM] upgraded rocm-clang-ocl (6.0.2-1 -> 6.1.2-1)
[2024-11-11T17:25:41+0800] [ALPM] upgraded rocm-opencl-runtime (6.0.2-1 -> 6.2.2-1)
[2024-11-11T17:25:06+0800] [ALPM] upgraded rocminfo (6.0.2-1 -> 6.2.2-1)
[2024-11-11T17:25:05+0800] [ALPM] upgraded rocm-device-libs (6.0.2-1 -> 6.2.2-2)
[2024-11-11T17:25:05+0800] [ALPM] upgraded rocm-llvm (6.0.2-1 -> 6.2.2-2)
[2024-11-11T17:24:59+0800] [ALPM] upgraded rocm-core (6.0.2-2 -> 6.2.2-1)

Before this update, when i run ollama serve use following command:

HSA_OVERRIDE_GFX_VERSION=11.0.0 OLLAMA_KEEP_ALIVE=-1 ollama serve

I can run ollama run llama3.2 succssful.

But, after this update, i can still run serve without issue: (check following start log)

ollama serve start up log
 ╰──➤ $ HSA_OVERRIDE_GFX_VERSION=11.0.0 OLLAMA_KEEP_ALIVE=-1 ollama serve
2024/11/11 22:46:36 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION:11.0.0 HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/zw963/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-11-11T22:46:36.620+08:00 level=INFO source=images.go:755 msg="total blobs: 15"
time=2024-11-11T22:46:36.620+08:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0"
time=2024-11-11T22:46:36.620+08:00 level=INFO source=routes.go:1240 msg="Listening on 127.0.0.1:11434 (version 0.4.1)"
time=2024-11-11T22:46:36.621+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3370373680/runners
time=2024-11-11T22:46:36.670+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v11 cuda_v12 rocm cpu cpu_avx cpu_avx2]"
time=2024-11-11T22:46:36.670+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-11T22:46:36.695+08:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-11-11T22:46:36.695+08:00 level=INFO source=amd_linux.go:386 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=11.0.0
time=2024-11-11T22:46:36.695+08:00 level=INFO source=types.go:123 msg="inference compute" id=0 library=rocm variant="" compute=gfx1103 driver=0.0 name=1002:15bf total="8.0 GiB" available="5.9 GiB"

But, it failed with following log when run ollama run llama3.2

2024/11/11 - 22:50:32 | 200 |      17.033µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/11/11 - 22:50:32 | 200 |    13.61116ms |       127.0.0.1 | POST     "/api/show"
time=2024-11-11T22:50:32.351+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/home/zw963/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=0 parallel=4 available=6257360896 required="3.7 GiB"
time=2024-11-11T22:50:32.352+08:00 level=INFO source=server.go:105 msg="system memory" total="54.7 GiB" free="45.2 GiB" free_swap="63.0 GiB"
time=2024-11-11T22:50:32.352+08:00 level=INFO source=memory.go:343 msg="offload to rocm" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[5.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2024-11-11T22:50:32.354+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3370373680/runners/rocm/ollama_llama_server --model /home/zw963/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 28751"
time=2024-11-11T22:50:32.354+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-11-11T22:50:32.354+08:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
time=2024-11-11T22:50:32.354+08:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
/tmp/ollama3370373680/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory
time=2024-11-11T22:50:32.605+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 127"
[GIN] 2024/11/11 - 22:50:32 | 500 |  288.728002ms |       127.0.0.1 | POST     "/api/generate"

I can still use it if not set HSA_OVERRIDE_GFX_VERSION=11.0.0 when start ollama server.

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.4.1


EDIT:

I try install Arch linux builtin verison ollama-rocm, and run it use root user,

Name            : ollama-rocm
Version         : 0.3.12-6
Description     : Create, run and share large language models (LLMs) with ROCm
Architecture    : x86_64
URL             : https://github.com/ollama/ollama
Licenses        : MIT
Groups          : None
Provides        : ollama
Depends On      : hipblas
Optional Deps   : rocm-smi-lib: monitor GPU usage with rocm-smi [installed]
Required By     : None
Optional For    : None
Conflicts With  : ollama
Replaces        : None
Installed Size  : 62.95 MiB
Packager        : Alexander F. Rødseth <xyproto@archlinux.org>
Build Date      : Sun 10 Nov 2024 01:52:37 AM CST
Install Date    : Wed 13 Nov 2024 12:41:00 PM CST
Install Reason  : Explicitly installed
Install Script  : No
Validated By    : Signature

But, no luck, with following error message:

 ╰──➤ $ ollama run llama3.2
Error: llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890
  hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
/build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error
<!-- gh-comment-id:2472356664 --> @zw963 commented on GitHub (Nov 13, 2024): I copied my original issue to here for reference. ### What is the issue? I use Arch linux, i update my package to latest today. Following is my upgraded version: ``` [2024-11-11T17:25:48+0800] [ALPM] upgraded rocm-opencl-sdk (6.0.2-1 -> 6.2.2-1) [2024-11-11T17:25:48+0800] [ALPM] upgraded python-pytorch-rocm (2.3.1-8 -> 2.5.1-3) [2024-11-11T17:25:45+0800] [ALPM] upgraded rocm-hip-sdk (6.0.2-1 -> 6.2.2-1) [2024-11-11T17:25:45+0800] [ALPM] upgraded rocm-hip-libraries (6.0.2-1 -> 6.2.2-1) [2024-11-11T17:25:44+0800] [ALPM] upgraded rocm-smi-lib (6.0.2-1 -> 6.2.2-1) [2024-11-11T17:25:44+0800] [ALPM] upgraded rocm-hip-runtime (6.0.2-1 -> 6.2.2-1) [2024-11-11T17:25:44+0800] [ALPM] upgraded rocm-cmake (6.0.2-1 -> 6.2.2-1) [2024-11-11T17:25:44+0800] [ALPM] upgraded rocm-language-runtime (6.0.2-1 -> 6.2.2-1) [2024-11-11T17:25:41+0800] [ALPM] upgraded rocm-clang-ocl (6.0.2-1 -> 6.1.2-1) [2024-11-11T17:25:41+0800] [ALPM] upgraded rocm-opencl-runtime (6.0.2-1 -> 6.2.2-1) [2024-11-11T17:25:06+0800] [ALPM] upgraded rocminfo (6.0.2-1 -> 6.2.2-1) [2024-11-11T17:25:05+0800] [ALPM] upgraded rocm-device-libs (6.0.2-1 -> 6.2.2-2) [2024-11-11T17:25:05+0800] [ALPM] upgraded rocm-llvm (6.0.2-1 -> 6.2.2-2) [2024-11-11T17:24:59+0800] [ALPM] upgraded rocm-core (6.0.2-2 -> 6.2.2-1) ``` Before this update, when i run ollama serve use following command: ``` HSA_OVERRIDE_GFX_VERSION=11.0.0 OLLAMA_KEEP_ALIVE=-1 ollama serve ``` I can run `ollama run llama3.2` succssful. But, after this update, i can still run serve without issue: (check following start log) <details> <summary> ollama serve start up log</summary> ``` ╰──➤ $ HSA_OVERRIDE_GFX_VERSION=11.0.0 OLLAMA_KEEP_ALIVE=-1 ollama serve 2024/11/11 22:46:36 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION:11.0.0 HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/zw963/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-11-11T22:46:36.620+08:00 level=INFO source=images.go:755 msg="total blobs: 15" time=2024-11-11T22:46:36.620+08:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0" time=2024-11-11T22:46:36.620+08:00 level=INFO source=routes.go:1240 msg="Listening on 127.0.0.1:11434 (version 0.4.1)" time=2024-11-11T22:46:36.621+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3370373680/runners time=2024-11-11T22:46:36.670+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v11 cuda_v12 rocm cpu cpu_avx cpu_avx2]" time=2024-11-11T22:46:36.670+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-11-11T22:46:36.695+08:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-11-11T22:46:36.695+08:00 level=INFO source=amd_linux.go:386 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=11.0.0 time=2024-11-11T22:46:36.695+08:00 level=INFO source=types.go:123 msg="inference compute" id=0 library=rocm variant="" compute=gfx1103 driver=0.0 name=1002:15bf total="8.0 GiB" available="5.9 GiB" ``` </details> But, it failed with following log when run `ollama run llama3.2` ``` 2024/11/11 - 22:50:32 | 200 | 17.033µs | 127.0.0.1 | HEAD "/" [GIN] 2024/11/11 - 22:50:32 | 200 | 13.61116ms | 127.0.0.1 | POST "/api/show" time=2024-11-11T22:50:32.351+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/home/zw963/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=0 parallel=4 available=6257360896 required="3.7 GiB" time=2024-11-11T22:50:32.352+08:00 level=INFO source=server.go:105 msg="system memory" total="54.7 GiB" free="45.2 GiB" free_swap="63.0 GiB" time=2024-11-11T22:50:32.352+08:00 level=INFO source=memory.go:343 msg="offload to rocm" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[5.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" time=2024-11-11T22:50:32.354+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3370373680/runners/rocm/ollama_llama_server --model /home/zw963/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 28751" time=2024-11-11T22:50:32.354+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-11-11T22:50:32.354+08:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding" time=2024-11-11T22:50:32.354+08:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" /tmp/ollama3370373680/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory time=2024-11-11T22:50:32.605+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 127" [GIN] 2024/11/11 - 22:50:32 | 500 | 288.728002ms | 127.0.0.1 | POST "/api/generate" ``` I can still use it if not set `HSA_OVERRIDE_GFX_VERSION=11.0.0` when start ollama server. ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.4.1 -------------- __EDIT:__ I try install Arch linux builtin verison ollama-rocm, and run it use root user, ``` Name : ollama-rocm Version : 0.3.12-6 Description : Create, run and share large language models (LLMs) with ROCm Architecture : x86_64 URL : https://github.com/ollama/ollama Licenses : MIT Groups : None Provides : ollama Depends On : hipblas Optional Deps : rocm-smi-lib: monitor GPU usage with rocm-smi [installed] Required By : None Optional For : None Conflicts With : ollama Replaces : None Installed Size : 62.95 MiB Packager : Alexander F. Rødseth <xyproto@archlinux.org> Build Date : Sun 10 Nov 2024 01:52:37 AM CST Install Date : Wed 13 Nov 2024 12:41:00 PM CST Install Reason : Explicitly installed Install Script : No Validated By : Signature ``` But, no luck, with following error message: ``` ╰──➤ $ ollama run llama3.2 Error: llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890 hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT) /build/ollama-rocm/src/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error ```
Author
Owner

@dhiltgen commented on GitHub (Nov 13, 2024):

@zw963 I'd suggest running with OLLAMA_DEBUG=1 set while you're working through the packaging misalignments. The 0.4.0 release changed a number of things on how we build and the packaging layout. The debug logs should help spot where things are out of sync. I'd also suggest trying my branch from #7499

<!-- gh-comment-id:2474816318 --> @dhiltgen commented on GitHub (Nov 13, 2024): @zw963 I'd suggest running with `OLLAMA_DEBUG=1` set while you're working through the packaging misalignments. The 0.4.0 release changed a number of things on how we build and the packaging layout. The debug logs should help spot where things are out of sync. I'd also suggest trying my branch from #7499
Author
Owner

@unclemusclez commented on GitHub (Nov 13, 2024):

@zw963 I'd suggest running with OLLAMA_DEBUG=1 set while you're working through the packaging misalignments. The 0.4.0 release changed a number of things on how we build and the packaging layout. The debug logs should help spot where things are out of sync. I'd also suggest trying my branch from #7499

@dhiltgen i tried your PR with: AMDGPU_TARGETS="gfx906" make -j16

It still seems to produce every version of the driver, and i get the same error of /tmp/ollama3533626889/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory

43 msg="offload to rocm" layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available "[32.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="27.9 GiB" memory.required.partial="27.9 GiB" memory.required.partial="27.9 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[27.9 GiB]" memory.weights.total="25.8 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"

<!-- gh-comment-id:2475019159 --> @unclemusclez commented on GitHub (Nov 13, 2024): > @zw963 I'd suggest running with `OLLAMA_DEBUG=1` set while you're working through the packaging misalignments. The 0.4.0 release changed a number of things on how we build and the packaging layout. The debug logs should help spot where things are out of sync. I'd also suggest trying my branch from #7499 @dhiltgen i tried your PR with: `AMDGPU_TARGETS="gfx906" make -j16` It still seems to produce every version of the driver, and i get the same error of `/tmp/ollama3533626889/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory` `43 msg="offload to rocm" layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available "[32.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="27.9 GiB" memory.required.partial="27.9 GiB" memory.required.partial="27.9 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[27.9 GiB]" memory.weights.total="25.8 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"`
Author
Owner

@svenstaro commented on GitHub (Nov 14, 2024):

Note Arch people on ROCm: I'm pretty sure you're actually seeing https://gitlab.archlinux.org/archlinux/packaging/packages/rocblas/-/issues/2. ollama upstream, sorry for the noise! We're trying to sort this out. It's all a bit confusing though it appears to be a ROCm bug from the looks of it. A user in the linked issue is currently digging into it. It apparently is already fixed upstream so maybe we can backport the fix.

<!-- gh-comment-id:2475156936 --> @svenstaro commented on GitHub (Nov 14, 2024): Note Arch people on ROCm: I'm pretty sure you're actually seeing https://gitlab.archlinux.org/archlinux/packaging/packages/rocblas/-/issues/2. ollama upstream, sorry for the noise! We're trying to sort this out. It's all a bit confusing though it appears to be a ROCm bug from the looks of it. A user in the linked issue is currently digging into it. It apparently is already fixed upstream so maybe we can backport the fix.
Author
Owner

@Chris2000SP commented on GitHub (Nov 14, 2024):

I downgraded succesfully by going the whole dependency path through of ollama-rocm on Arch. What i didn't downgrade was mesa because i want to play games aso. Hope that helps. EDIT: i forgot to say that ollama-rocm it self i didn't downgrade.

<!-- gh-comment-id:2475434412 --> @Chris2000SP commented on GitHub (Nov 14, 2024): I downgraded succesfully by going the whole dependency path through of `ollama-rocm` on Arch. What i didn't downgrade was `mesa` because i want to play games aso. Hope that helps. EDIT: i forgot to say that `ollama-rocm` it self i didn't downgrade.
Author
Owner

@zw963 commented on GitHub (Nov 14, 2024):

I downgraded succesfully by going the whole dependency path through of ollama-rocm on Arch. What i didn't downgrade was mesa because i want to play games aso. Hope that helps. EDIT: i forgot to say that ollama-rocm it self i didn't downgrade.

I downgrade all packages yesterday, to 2024/11/04, but no luck.

<!-- gh-comment-id:2476145525 --> @zw963 commented on GitHub (Nov 14, 2024): > I downgraded succesfully by going the whole dependency path through of ollama-rocm on Arch. What i didn't downgrade was mesa because i want to play games aso. Hope that helps. EDIT: i forgot to say that ollama-rocm it self i didn't downgrade. I downgrade all packages yesterday, to 2024/11/04, but no luck.
Author
Owner

@dhiltgen commented on GitHub (Nov 14, 2024):

@unclemusclez I've further refined how the -L paths are being passed to the build. Please give the latest commits on the branch a try. If it still fails to find things, can you share the build output at the end so I can see the go build that didn't work correctly?

<!-- gh-comment-id:2477051053 --> @dhiltgen commented on GitHub (Nov 14, 2024): @unclemusclez I've further refined how the `-L` paths are being passed to the build. Please give the latest commits on the branch a try. If it still fails to find things, can you share the build output at the end so I can see the `go build` that didn't work correctly?
Author
Owner

@Chris2000SP commented on GitHub (Nov 14, 2024):

I downgraded succesfully by going the whole dependency path through of ollama-rocm on Arch. What i didn't downgrade was mesa because i want to play games aso. Hope that helps. EDIT: i forgot to say that ollama-rocm it self i didn't downgrade.

I downgrade all packages yesterday, to 2024/11/04, but no luck.

I have rocm downgraded to the version 6.0.2 or the date 2024/10/20 for all dependencies.

<!-- gh-comment-id:2477077385 --> @Chris2000SP commented on GitHub (Nov 14, 2024): > > I downgraded succesfully by going the whole dependency path through of ollama-rocm on Arch. What i didn't downgrade was mesa because i want to play games aso. Hope that helps. EDIT: i forgot to say that ollama-rocm it self i didn't downgrade. > > I downgrade all packages yesterday, to 2024/11/04, but no luck. I have rocm downgraded to the version 6.0.2 or the date 2024/10/20 for all dependencies.
Author
Owner

@Tealk commented on GitHub (Nov 14, 2024):

I also noticed the problem today and reinstalled extra/ollama-rocm 0.4.1-1. I had previously installed extra/ollama 0.4.1-1 and that did not have the problem.

Nov 14 19:03:38 FrameWork ollama[2486]: [GIN] 2024/11/14 - 19:03:38 | 200 |      28.743µs |       127.0.0.1 | HEAD     "/"
Nov 14 19:03:38 FrameWork ollama[2486]: [GIN] 2024/11/14 - 19:03:38 | 200 |   10.776396ms |       127.0.0.1 | POST     "/api/show"
Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.332+01:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 gpu=0 parallel=4 available=7591833600 required="5.6 GiB"
Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.333+01:00 level=INFO source=server.go:105 msg="system memory" total="30.7 GiB" free="21.8 GiB" free_swap="4.0 GiB"
Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.333+01:00 level=INFO source=memory.go:343 msg="offload to rocm" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[7.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.6 GiB" memory.required.partial="5.6 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[5.6 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB"
Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.334+01:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3267574408/runners/rocm/ollama_llama_server --model /var/lib/ollama/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 37767"
Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.334+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.334+01:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.334+01:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
Nov 14 19:03:38 FrameWork ollama[2486]: /tmp/ollama3267574408/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory
Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.585+01:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 127"
Nov 14 19:03:38 FrameWork ollama[2486]: [GIN] 2024/11/14 - 19:03:38 | 500 |  284.840795ms |       127.0.0.1 | POST     "/api/generate"
<!-- gh-comment-id:2477084990 --> @Tealk commented on GitHub (Nov 14, 2024): I also noticed the problem today and reinstalled extra/ollama-rocm 0.4.1-1. I had previously installed extra/ollama 0.4.1-1 and that did not have the problem. ``` Nov 14 19:03:38 FrameWork ollama[2486]: [GIN] 2024/11/14 - 19:03:38 | 200 | 28.743µs | 127.0.0.1 | HEAD "/" Nov 14 19:03:38 FrameWork ollama[2486]: [GIN] 2024/11/14 - 19:03:38 | 200 | 10.776396ms | 127.0.0.1 | POST "/api/show" Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.332+01:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 gpu=0 parallel=4 available=7591833600 required="5.6 GiB" Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.333+01:00 level=INFO source=server.go:105 msg="system memory" total="30.7 GiB" free="21.8 GiB" free_swap="4.0 GiB" Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.333+01:00 level=INFO source=memory.go:343 msg="offload to rocm" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[7.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.6 GiB" memory.required.partial="5.6 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[5.6 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.334+01:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3267574408/runners/rocm/ollama_llama_server --model /var/lib/ollama/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 37767" Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.334+01:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.334+01:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding" Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.334+01:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" Nov 14 19:03:38 FrameWork ollama[2486]: /tmp/ollama3267574408/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory Nov 14 19:03:38 FrameWork ollama[2486]: time=2024-11-14T19:03:38.585+01:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 127" Nov 14 19:03:38 FrameWork ollama[2486]: [GIN] 2024/11/14 - 19:03:38 | 500 | 284.840795ms | 127.0.0.1 | POST "/api/generate" ```
Author
Owner

@Chris2000SP commented on GitHub (Nov 14, 2024):

@Tealk did you test the output of your LLM?

<!-- gh-comment-id:2477088728 --> @Chris2000SP commented on GitHub (Nov 14, 2024): @Tealk did you test the output of your LLM?
Author
Owner

@Tealk commented on GitHub (Nov 14, 2024):

What exactly do you mean by that?

ollama run qwen2.5:latest
Error: llama runner process has terminated: exit status 127
<!-- gh-comment-id:2477091858 --> @Tealk commented on GitHub (Nov 14, 2024): What exactly do you mean by that? ``` ollama run qwen2.5:latest Error: llama runner process has terminated: exit status 127 ```
Author
Owner

@Chris2000SP commented on GitHub (Nov 14, 2024):

Sorry, my bad. I misunderstood your sentence.

For me, it works with downgrading the dependencies except of mesa and ollama-rocm itself.

<!-- gh-comment-id:2477095638 --> @Chris2000SP commented on GitHub (Nov 14, 2024): Sorry, my bad. I misunderstood your sentence. For me, it works with downgrading the dependencies except of `mesa` and `ollama-rocm` itself.
Author
Owner

@Tealk commented on GitHub (Nov 14, 2024):

On which version?

<!-- gh-comment-id:2477103234 --> @Tealk commented on GitHub (Nov 14, 2024): On which version?
Author
Owner

@Chris2000SP commented on GitHub (Nov 14, 2024):

I downgraded succesfully by going the whole dependency path through of ollama-rocm on Arch. What i didn't downgrade was mesa because i want to play games aso. Hope that helps. EDIT: i forgot to say that ollama-rocm it self i didn't downgrade.

I downgrade all packages yesterday, to 2024/11/04, but no luck.

I have rocm downgraded to the version 6.0.2 or the date 2024/10/20 for all dependencies.

This here ^^^^^^^

<!-- gh-comment-id:2477104697 --> @Chris2000SP commented on GitHub (Nov 14, 2024): > > > I downgraded succesfully by going the whole dependency path through of ollama-rocm on Arch. What i didn't downgrade was mesa because i want to play games aso. Hope that helps. EDIT: i forgot to say that ollama-rocm it self i didn't downgrade. > > > > > > I downgrade all packages yesterday, to 2024/11/04, but no luck. > > I have rocm downgraded to the version 6.0.2 or the date 2024/10/20 for all dependencies. This here ^^^^^^^
Author
Owner

@Tealk commented on GitHub (Nov 14, 2024):

Unfortunately I can't find that on my computer, all versions start with 0.

image

<!-- gh-comment-id:2477108519 --> @Tealk commented on GitHub (Nov 14, 2024): Unfortunately I can't find that on my computer, all versions start with 0. ![image](https://github.com/user-attachments/assets/40602f1d-32a1-44c2-af02-1616fc37df53)
Author
Owner

@Chris2000SP commented on GitHub (Nov 14, 2024):

Unfortunately I can't find that on my computer, all versions start with 0.

image

Arch has a archive for that.

<!-- gh-comment-id:2477109764 --> @Chris2000SP commented on GitHub (Nov 14, 2024): > Unfortunately I can't find that on my computer, all versions start with 0. > > ![image](https://private-user-images.githubusercontent.com/12276250/386319877-40602f1d-32a1-44c2-af02-1616fc37df53.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzE2MDg0MTIsIm5iZiI6MTczMTYwODExMiwicGF0aCI6Ii8xMjI3NjI1MC8zODYzMTk4NzctNDA2MDJmMWQtMzJhMS00NGMyLWFmMDItMTYxNmZjMzdkZjUzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDExMTQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQxMTE0VDE4MTUxMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTY5MDA2MGY2OGIwMGZjNDk5MmVhOTMzYjMxNzBkZDI2MjI5NGVjYjRlYmVjZGM3MDQ1ZjE0MzQ2ZGJkYzRlZjMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.sG3adOgG-ArwxNm4S6pyTA2mFeFCMHZoWcCmA1n5tx0) Arch has a archive for that.
Author
Owner

@Chris2000SP commented on GitHub (Nov 14, 2024):

@Tealk I mean all dependencies of ollama-rocm not the package it self. Meaning hipblas and so on.

<!-- gh-comment-id:2477122584 --> @Chris2000SP commented on GitHub (Nov 14, 2024): @Tealk I mean all dependencies of ollama-rocm not the package it self. Meaning hipblas and so on.
Author
Owner

@Tealk commented on GitHub (Nov 14, 2024):

You mean the dependencies, ok then I'll wait for a patch.

<!-- gh-comment-id:2477124754 --> @Tealk commented on GitHub (Nov 14, 2024): You mean the dependencies, ok then I'll wait for a patch.
Author
Owner

@Chris2000SP commented on GitHub (Nov 14, 2024):

rocm-core-6.0.2-2, hipblas-6.0.2-1, hsa-rocr-6.0.2-2, rccl-6.0.2-1, rocalution-6.0.2-2, rocblas-6.0.2-1, rocfft-6.0.2-1, hip-runtime-amd-6.0.2-4, comgr-6.0.2-1, rocm-device-libs-6.0.2-1, rocm-llvm-6.0.2-1, rocprofiler-6.0.2-2, rocsolver-6.0.2-3, rocsparse-6.0.2-2

Here is my downgrade list i have done downgrading.

<!-- gh-comment-id:2477138798 --> @Chris2000SP commented on GitHub (Nov 14, 2024): `rocm-core-6.0.2-2`, `hipblas-6.0.2-1`, `hsa-rocr-6.0.2-2`, `rccl-6.0.2-1`, `rocalution-6.0.2-2`, `rocblas-6.0.2-1`, `rocfft-6.0.2-1`, `hip-runtime-amd-6.0.2-4`, `comgr-6.0.2-1`, `rocm-device-libs-6.0.2-1`, `rocm-llvm-6.0.2-1`, `rocprofiler-6.0.2-2`, `rocsolver-6.0.2-3`, `rocsparse-6.0.2-2` Here is my downgrade list i have done downgrading.
Author
Owner

@Chris2000SP commented on GitHub (Nov 14, 2024):

i forgot that i done this export /usr/lib/ollama to LD_LIBRARY_PATH. Sorry. i just realized that.

<!-- gh-comment-id:2477236029 --> @Chris2000SP commented on GitHub (Nov 14, 2024): i forgot that i done this export `/usr/lib/ollama` to `LD_LIBRARY_PATH`. Sorry. i just realized that.
Author
Owner

@stephensrmmartin commented on GitHub (Nov 14, 2024):

Confirming this: I just did the following downgrades:

hipblas 6.0.2-1 <- 6.2.2-1
hsa-rocr 6.0.2-2 <- 6.2.1-1
rccl 6.0.2-1 <- 6.2.2-1
rocalution 6.0.2-2 <- 6.2.2-1
rocblas 6.0.2-1 <- 6.2.2-1
rocfft 6.0.2-1 <- 6.2.2-1
rocm-clang-ocl 6.0.2-1 <- 6.1.2-1
rocm-cmake 6.0.2-1 <- 6.2.2-1
rocm-core 6.0.2-2 <- 6.2.2-1
rocm-device-libs 6.0.2-1 <- 6.2.2-2
rocm-hip-libraries 6.0.2-1 <- 6.2.2-1
rocm-hip-runtime 6.0.2-1 <- 6.2.2-1
rocm-hip-sdk 6.0.2-1 <- 6.2.2-1
rocm-language-runtime 6.0.2-1 <- 6.2.2-1
rocm-llvm 6.0.2-1 <- 6.2.2-2
rocm-opencl-runtime 6.0.2-1 <- 6.2.2-1
rocm-opencl-sdk 6.0.2-1 <- 6.2.2-1
rocm-smi-lib 6.0.2-1 <- 6.2.2-1
rocminfo 6.0.2-1 <- 6.2.2-1
rocrand 6.0.2-1 <- 6.2.2-1
rocsolver 6.0.2-3 <- 6.2.2-1
rocsparse 6.0.2-2 <- 6.2.2-1
rocthrust 6.0.2-1 <- 6.2.2-1
roctracer 6.0.2-1 <- 6.2.2-1

(Note the direction of the arrow).

I also had to manually move libggml_rocm.so from /usr/lib/ollama to /tmp/systemd-...ollama-.../tmp/ollama/runners/rocm

After that, ollama 0.4.1 appears to run without error.

Edit: Well, not without error; the llama 3.2 vision model fails.:

mllama_model_load: vision using CUDA backend
Nov 14 13:58:00 hwkiller-desktop ollama[68989]: time=2024-11-14T13:58:00.217-08:00 level=DEBUG source=server.go:607 msg="model load progress 1.00"
Nov 14 13:58:00 hwkiller-desktop ollama[68989]: ggml.c:6712: GGML_ASSERT(a->ne[2] == b->ne[2]) failed
Nov 14 13:58:00 hwkiller-desktop ollama[68989]: ptrace: Operation not permitted.
Nov 14 13:58:00 hwkiller-desktop ollama[68989]: No stack.
Nov 14 13:58:00 hwkiller-desktop ollama[68989]: The program is not being run.
Nov 14 13:58:00 hwkiller-desktop ollama[68989]: SIGABRT: abort

<!-- gh-comment-id:2477484790 --> @stephensrmmartin commented on GitHub (Nov 14, 2024): Confirming this: I just did the following downgrades: ``` hipblas 6.0.2-1 <- 6.2.2-1 hsa-rocr 6.0.2-2 <- 6.2.1-1 rccl 6.0.2-1 <- 6.2.2-1 rocalution 6.0.2-2 <- 6.2.2-1 rocblas 6.0.2-1 <- 6.2.2-1 rocfft 6.0.2-1 <- 6.2.2-1 rocm-clang-ocl 6.0.2-1 <- 6.1.2-1 rocm-cmake 6.0.2-1 <- 6.2.2-1 rocm-core 6.0.2-2 <- 6.2.2-1 rocm-device-libs 6.0.2-1 <- 6.2.2-2 rocm-hip-libraries 6.0.2-1 <- 6.2.2-1 rocm-hip-runtime 6.0.2-1 <- 6.2.2-1 rocm-hip-sdk 6.0.2-1 <- 6.2.2-1 rocm-language-runtime 6.0.2-1 <- 6.2.2-1 rocm-llvm 6.0.2-1 <- 6.2.2-2 rocm-opencl-runtime 6.0.2-1 <- 6.2.2-1 rocm-opencl-sdk 6.0.2-1 <- 6.2.2-1 rocm-smi-lib 6.0.2-1 <- 6.2.2-1 rocminfo 6.0.2-1 <- 6.2.2-1 rocrand 6.0.2-1 <- 6.2.2-1 rocsolver 6.0.2-3 <- 6.2.2-1 rocsparse 6.0.2-2 <- 6.2.2-1 rocthrust 6.0.2-1 <- 6.2.2-1 roctracer 6.0.2-1 <- 6.2.2-1 ``` (Note the direction of the arrow). I also had to manually move libggml_rocm.so from /usr/lib/ollama to /tmp/systemd-...ollama-.../tmp/ollama/runners/rocm After that, ollama 0.4.1 appears to run without error. Edit: Well, not *without* error; the llama 3.2 vision model fails.: ``` mllama_model_load: vision using CUDA backend Nov 14 13:58:00 hwkiller-desktop ollama[68989]: time=2024-11-14T13:58:00.217-08:00 level=DEBUG source=server.go:607 msg="model load progress 1.00" Nov 14 13:58:00 hwkiller-desktop ollama[68989]: ggml.c:6712: GGML_ASSERT(a->ne[2] == b->ne[2]) failed Nov 14 13:58:00 hwkiller-desktop ollama[68989]: ptrace: Operation not permitted. Nov 14 13:58:00 hwkiller-desktop ollama[68989]: No stack. Nov 14 13:58:00 hwkiller-desktop ollama[68989]: The program is not being run. Nov 14 13:58:00 hwkiller-desktop ollama[68989]: SIGABRT: abort ```
Author
Owner

@Chris2000SP commented on GitHub (Nov 14, 2024):

Yeah, cool. But i was just guessing the rocm version that could be without bugs. I didn't dig in to it, and i am not developer enough to do so.

EDIT: llama 3.2 vision is brand new. No wonder it's not working. I am glad that solar-pro works because it is the best model to date for my 6800 XT

<!-- gh-comment-id:2477630278 --> @Chris2000SP commented on GitHub (Nov 14, 2024): Yeah, cool. But i was just guessing the rocm version that could be without bugs. I didn't dig in to it, and i am not developer enough to do so. EDIT: llama 3.2 vision is brand new. No wonder it's not working. I am glad that solar-pro works because it is the best model to date for my 6800 XT
Author
Owner

@unclemusclez commented on GitHub (Nov 15, 2024):

@dhiltgen
/usr/local/bin/llama/build/linux-amd64/runners /usr/local/lib/ollama/runners seem to be at least part of the issue.
is there a make install we are supposed to be using?

if i copy the compiled locations ./ollama/runners and ./ollama/llama/build/linux-amd64/runners to /usr/local/lib/ollama/runners it seems to load.

aside from the compilation of all of the drivers and manual migration of the runners, it's working on ROCm 6.2.4 Ubuntu 24.04

 sudo systemctl status ollama
● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
     Active: active (running) since Fri 2024-11-15 00:38:11 UTC; 2min 26s ago
   Main PID: 40074 (ollama)
      Tasks: 23 (limit: 9830)
     Memory: 1.2G (peak: 1.2G)
        CPU: 1min 901ms
     CGroup: /system.slice/ollama.service
             ├─40074 /usr/local/bin/ollama serve
             └─40204 /usr/local/lib/ollama/runners/rocm_avx/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-a39d684d796a7257d3d75c6c360d6ce85d9219156441f3b6f4d77b2b91ca9147 --ctx-siz>

Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: freq_base  = 1000000.0
Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: freq_scale = 1
Nov 15 00:40:36 kamala ollama[40074]: llama_kv_cache_init:      ROCm0 KV buffer size =  2048.00 MiB
Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model:  ROCm_Host  output buffer size =     2.40 MiB
Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model:      ROCm0 compute buffer size =   696.00 MiB
Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model:  ROCm_Host compute buffer size =    26.01 MiB
Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: graph nodes  = 2246
Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: graph splits = 2
Nov 15 00:40:36 kamala ollama[40074]: time=2024-11-15T00:40:36.556Z level=INFO source=server.go:601 msg="llama runner started in 127.15 seconds"
 AMDGPU_TARGETS="gfx906" make -j16
GOARCH=amd64 go build -buildmode=pie "-ldflags=-w -s \"-X=github.com/ollama/ollama/version.Version=0.4.1-15-g413fb24\" \"-X=github.com/ollama/ollama/llama.CpuFeatures=\" " -trimpath   -o build/linux-amd64/runners/cpu/ollama_llama_server ./runner
GOARCH=amd64 go build -buildmode=pie "-ldflags=-w -s \"-X=github.com/ollama/ollama/version.Version=0.4.1-15-g413fb24\" \"-X=github.com/ollama/ollama/llama.CpuFeatures="avx"\" " -trimpath -tags "avx" -o build/linux-amd64/runners/cpu_avx/ollama_llama_server ./runner
GOARCH=amd64 go build -buildmode=pie "-ldflags=-w -s \"-X=github.com/ollama/ollama/version.Version=0.4.1-15-g413fb24\" \"-X=github.com/ollama/ollama/llama.CpuFeatures="avx,avx2"\" " -trimpath -tags "avx,avx2" -o build/linux-amd64/runners/cpu_avx2/ollama_llama_server ./runner
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda.rocm.o ggml-cuda.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/acc.rocm.o ggml-cuda/acc.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/arange.rocm.o ggml-cuda/arange.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/argsort.rocm.o ggml-cuda/argsort.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/binbcast.rocm.o ggml-cuda/binbcast.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/clamp.rocm.o ggml-cuda/clamp.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/concat.rocm.o ggml-cuda/concat.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/conv-transpose-1d.rocm.o ggml-cuda/conv-transpose-1d.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/convert.rocm.o ggml-cuda/convert.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/cpy.rocm.o ggml-cuda/cpy.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/cross-entropy-loss.rocm.o ggml-cuda/cross-entropy-loss.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/diagmask.rocm.o ggml-cuda/diagmask.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/dmmv.rocm.o ggml-cuda/dmmv.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/im2col.rocm.o ggml-cuda/im2col.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/getrows.rocm.o ggml-cuda/getrows.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/mmq.rocm.o ggml-cuda/mmq.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/mmvq.rocm.o ggml-cuda/mmvq.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/norm.rocm.o ggml-cuda/norm.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/opt-step-adamw.rocm.o ggml-cuda/opt-step-adamw.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/out-prod.rocm.o ggml-cuda/out-prod.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/pad.rocm.o ggml-cuda/pad.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/pool2d.rocm.o ggml-cuda/pool2d.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/quantize.rocm.o ggml-cuda/quantize.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/rope.rocm.o ggml-cuda/rope.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/rwkv-wkv.rocm.o ggml-cuda/rwkv-wkv.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/scale.rocm.o ggml-cuda/scale.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/softmax.rocm.o ggml-cuda/softmax.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/sum.rocm.o ggml-cuda/sum.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/sumrows.rocm.o ggml-cuda/sumrows.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/tsembd.rocm.o ggml-cuda/tsembd.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/unary.rocm.o ggml-cuda/unary.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/upscale.rocm.o ggml-cuda/upscale.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq1_s.rocm.o ggml-cuda/template-instances/mmq-instance-iq1_s.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_s.rocm.o ggml-cuda/template-instances/mmq-instance-iq2_s.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_xs.rocm.o ggml-cuda/template-instances/mmq-instance-iq2_xs.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_xxs.rocm.o ggml-cuda/template-instances/mmq-instance-iq2_xxs.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq3_s.rocm.o ggml-cuda/template-instances/mmq-instance-iq3_s.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq3_xxs.rocm.o ggml-cuda/template-instances/mmq-instance-iq3_xxs.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq4_nl.rocm.o ggml-cuda/template-instances/mmq-instance-iq4_nl.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq4_xs.rocm.o ggml-cuda/template-instances/mmq-instance-iq4_xs.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q2_k.rocm.o ggml-cuda/template-instances/mmq-instance-q2_k.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q3_k.rocm.o ggml-cuda/template-instances/mmq-instance-q3_k.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_0.rocm.o ggml-cuda/template-instances/mmq-instance-q4_0.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_1.rocm.o ggml-cuda/template-instances/mmq-instance-q4_1.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_k.rocm.o ggml-cuda/template-instances/mmq-instance-q4_k.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_0.rocm.o ggml-cuda/template-instances/mmq-instance-q5_0.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_1.rocm.o ggml-cuda/template-instances/mmq-instance-q5_1.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_k.rocm.o ggml-cuda/template-instances/mmq-instance-q5_k.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q6_k.rocm.o ggml-cuda/template-instances/mmq-instance-q6_k.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q8_0.rocm.o ggml-cuda/template-instances/mmq-instance-q8_0.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -o build/linux-amd64/ggml.rocm.o ggml.c
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -o build/linux-amd64/ggml-backend.rocm.o ggml-backend.c
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -o build/linux-amd64/ggml-alloc.rocm.o ggml-alloc.c
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -o build/linux-amd64/ggml-quants.rocm.o ggml-quants.c
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -o build/linux-amd64/sgemm.rocm.o sgemm.cpp
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -o build/linux-amd64/ggml-aarch64.rocm.o ggml-aarch64.c
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/fattn-tile-f16.rocm.o ggml-cuda/fattn-tile-f16.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/fattn-tile-f32.rocm.o ggml-cuda/fattn-tile-f32.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/fattn.rocm.o ggml-cuda/fattn.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb16.rocm.o ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb16.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb32.rocm.o ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb32.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb16.rocm.o ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb16.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb32.rocm.o ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb32.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb8.rocm.o ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb8.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q4_0-q4_0.rocm.o ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q4_0-q4_0.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q4_0-q4_0.rocm.o ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q4_0-q4_0.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q8_0-q8_0.rocm.o ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q8_0-q8_0.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q8_0-q8_0.rocm.o ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q8_0-q8_0.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-f16-f16.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs256-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f16-instance-hs256-f16-f16.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs64-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f16-instance-hs64-f16-f16.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-f16-f16.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs256-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f32-instance-hs256-f16-f16.cu
/usr/bin/ccache /opt/rocm/bin/hipcc -c  -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs64-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f32-instance-hs64-f16-f16.cu
/usr/bin/ccache /opt/rocm/bin/hipcc --shared -L/opt/rocm/lib -lamdhip64 -L../dist/linux-amd64/lib/ollama  -lhipblas  -lrocblas ./build/linux-amd64/ggml-cuda.rocm.o ./build/linux-amd64/ggml-cuda/acc.rocm.o ./build/linux-amd64/ggml-cuda/arange.rocm.o ./build/linux-amd64/ggml-cuda/argsort.rocm.o ./build/linux-amd64/ggml-cuda/binbcast.rocm.o ./build/linux-amd64/ggml-cuda/clamp.rocm.o ./build/linux-amd64/ggml-cuda/concat.rocm.o ./build/linux-amd64/ggml-cuda/conv-transpose-1d.rocm.o ./build/linux-amd64/ggml-cuda/convert.rocm.o ./build/linux-amd64/ggml-cuda/cpy.rocm.o ./build/linux-amd64/ggml-cuda/cross-entropy-loss.rocm.o ./build/linux-amd64/ggml-cuda/diagmask.rocm.o ./build/linux-amd64/ggml-cuda/dmmv.rocm.o ./build/linux-amd64/ggml-cuda/getrows.rocm.o ./build/linux-amd64/ggml-cuda/im2col.rocm.o ./build/linux-amd64/ggml-cuda/mmq.rocm.o ./build/linux-amd64/ggml-cuda/mmvq.rocm.o ./build/linux-amd64/ggml-cuda/norm.rocm.o ./build/linux-amd64/ggml-cuda/opt-step-adamw.rocm.o ./build/linux-amd64/ggml-cuda/out-prod.rocm.o ./build/linux-amd64/ggml-cuda/pad.rocm.o ./build/linux-amd64/ggml-cuda/pool2d.rocm.o ./build/linux-amd64/ggml-cuda/quantize.rocm.o ./build/linux-amd64/ggml-cuda/rope.rocm.o ./build/linux-amd64/ggml-cuda/rwkv-wkv.rocm.o ./build/linux-amd64/ggml-cuda/scale.rocm.o ./build/linux-amd64/ggml-cuda/softmax.rocm.o ./build/linux-amd64/ggml-cuda/sum.rocm.o ./build/linux-amd64/ggml-cuda/sumrows.rocm.o ./build/linux-amd64/ggml-cuda/tsembd.rocm.o ./build/linux-amd64/ggml-cuda/unary.rocm.o ./build/linux-amd64/ggml-cuda/upscale.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq1_s.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_s.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_xs.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_xxs.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq3_s.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq3_xxs.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq4_nl.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq4_xs.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q2_k.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q3_k.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_1.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_k.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_1.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_k.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q6_k.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q8_0.rocm.o ./build/linux-amd64/ggml.rocm.o ./build/linux-amd64/ggml-backend.rocm.o ./build/linux-amd64/ggml-alloc.rocm.o ./build/linux-amd64/ggml-quants.rocm.o ./build/linux-amd64/sgemm.rocm.o ./build/linux-amd64/ggml-aarch64.rocm.o ./build/linux-amd64/ggml-cuda/fattn-tile-f16.rocm.o ./build/linux-amd64/ggml-cuda/fattn-tile-f32.rocm.o ./build/linux-amd64/ggml-cuda/fattn.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb32.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb32.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb8.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q4_0-q4_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q4_0-q4_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q8_0-q8_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q8_0-q8_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-f16-f16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs256-f16-f16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs64-f16-f16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-f16-f16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs256-f16-f16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs64-f16-f16.rocm.o -o build/linux-amd64/runners/rocm_avx/libggml_rocm.so
GOARCH=amd64 CGO_LDFLAGS="-L"./build/linux-amd64/runners/rocm_avx/" " go build -buildmode=pie "-ldflags=-w -s \"-X=github.com/ollama/ollama/version.Version=0.4.1-15-g413fb24\" \"-X=github.com/ollama/ollama/llama.CpuFeatures=avx\" " -trimpath -tags avx,rocm -o build/linux-amd64/runners/rocm_avx/ollama_llama_server ./runner
make[2]: Nothing to be done for 'exe'.
make[2]: Nothing to be done for 'exe'.
# github.com/ollama/ollama/llama
ggml.c: In function ‘ggml_vec_mad_f16’:
ggml.c:2378:45: warning: passing argument 1 of ‘__avx_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
 2378 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
      |                                             ^
ggml.c:1458:51: note: in definition of macro ‘GGML_F32Cx8_LOAD’
 1458 | #define GGML_F32Cx8_LOAD(x)     __avx_f32cx8_load(x)
      |                                                   ^
ggml.c:2378:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’
 2378 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
      |                     ^~~~~~~~~~~~~~~~~
ggml.c:1441:53: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_fp16_t *’ {aka ‘const short unsigned int *’}
 1441 | static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) {
      |                                        ~~~~~~~~~~~~~^
# github.com/ollama/ollama/llama
ggml-aarch64.c: In function ‘ggml_gemv_q4_0_8x8_q8_0’:
ggml-aarch64.c:978:81: warning: passing argument 1 of ‘__avx_rearranged_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
  978 |                 const __m256 col_scale_f32 = GGML_F32Cx8_REARRANGE_LOAD(b_ptr[b].d, changemask);
      |                                                                         ~~~~~~~~^~
ggml-aarch64.c:142:85: note: in definition of macro ‘GGML_F32Cx8_REARRANGE_LOAD’
  142 | #define GGML_F32Cx8_REARRANGE_LOAD(x, arrangeMask)     __avx_rearranged_f32cx8_load(x, arrangeMask)
      |                                                                                     ^
ggml-aarch64.c:128:64: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_half *’ {aka ‘const short unsigned int *’}
  128 | static inline __m256 __avx_rearranged_f32cx8_load(ggml_fp16_t *x, __m128i arrangeMask) {
      |                                                   ~~~~~~~~~~~~~^
ggml-aarch64.c: In function ‘ggml_gemm_q4_0_8x8_q8_0’:
ggml-aarch64.c:2943:75: warning: passing argument 1 of ‘__avx_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
 2943 |                     const __m256 col_scale_f32 = GGML_F32Cx8_LOAD(b_ptr[b].d);
      |                                                                   ~~~~~~~~^~
ggml-aarch64.c:140:51: note: in definition of macro ‘GGML_F32Cx8_LOAD’
  140 | #define GGML_F32Cx8_LOAD(x)     __avx_f32cx8_load(x)
      |                                                   ^
ggml-aarch64.c:109:53: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_half *’ {aka ‘const short unsigned int *’}
  109 | static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) {
      |                                        ~~~~~~~~~~~~~^
ggml-aarch64.c:3020:91: warning: passing argument 1 of ‘__avx_repeat_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
 3020 |                         const __m256 row_scale_f32 = GGML_F32Cx8_REPEAT_LOAD(a_ptrs[rp][b].d, loadMask);
      |                                                                              ~~~~~~~~~~~~~^~
ggml-aarch64.c:141:75: note: in definition of macro ‘GGML_F32Cx8_REPEAT_LOAD’
  141 | #define GGML_F32Cx8_REPEAT_LOAD(x, loadMask)     __avx_repeat_f32cx8_load(x)
      |                                                                           ^
ggml-aarch64.c:118:60: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_half *’ {aka ‘const short unsigned int *’}
  118 | static inline __m256 __avx_repeat_f32cx8_load(ggml_fp16_t *x) {
      |                                               ~~~~~~~~~~~~~^
ggml-aarch64.c:3107:75: warning: passing argument 1 of ‘__avx_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
 3107 |                     const __m256 col_scale_f32 = GGML_F32Cx8_LOAD(b_ptr[b].d);
      |                                                                   ~~~~~~~~^~
ggml-aarch64.c:140:51: note: in definition of macro ‘GGML_F32Cx8_LOAD’
  140 | #define GGML_F32Cx8_LOAD(x)     __avx_f32cx8_load(x)
      |                                                   ^
ggml-aarch64.c:109:53: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_half *’ {aka ‘const short unsigned int *’}
  109 | static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) {
      |                                        ~~~~~~~~~~~~~^
ggml-aarch64.c:3185:82: warning: passing argument 1 of ‘__avx_repeat_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
 3185 |                     const __m256 row_scale_f32 = GGML_F32Cx8_REPEAT_LOAD(a_ptr[b].d, loadMask);
      |                                                                          ~~~~~~~~^~
ggml-aarch64.c:141:75: note: in definition of macro ‘GGML_F32Cx8_REPEAT_LOAD’
  141 | #define GGML_F32Cx8_REPEAT_LOAD(x, loadMask)     __avx_repeat_f32cx8_load(x)
      |                                                                           ^
ggml-aarch64.c:118:60: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_half *’ {aka ‘const short unsigned int *’}
  118 | static inline __m256 __avx_repeat_f32cx8_load(ggml_fp16_t *x) {
      |                                               ~~~~~~~~~~~~~^
# github.com/ollama/ollama/llama
ggml.c: In function ‘ggml_vec_mad_f16’:
ggml.c:2378:45: warning: passing argument 1 of ‘__avx_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
 2378 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
      |                                             ^
ggml.c:1458:51: note: in definition of macro ‘GGML_F32Cx8_LOAD’
 1458 | #define GGML_F32Cx8_LOAD(x)     __avx_f32cx8_load(x)
      |                                                   ^
ggml.c:2378:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’
 2378 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
      |                     ^~~~~~~~~~~~~~~~~
ggml.c:1441:53: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_fp16_t *’ {aka ‘const short unsigned int *’}
 1441 | static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) {
      |                                        ~~~~~~~~~~~~~^
# github.com/ollama/ollama/llama
ggml.c: In function ‘ggml_vec_mad_f16’:
ggml.c:2378:45: warning: passing argument 1 of ‘__avx_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
 2378 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
      |                                             ^
ggml.c:1458:51: note: in definition of macro ‘GGML_F32Cx8_LOAD’
 1458 | #define GGML_F32Cx8_LOAD(x)     __avx_f32cx8_load(x)
      |                                                   ^
ggml.c:2378:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’
 2378 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
      |                     ^~~~~~~~~~~~~~~~~
ggml.c:1441:53: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_fp16_t *’ {aka ‘const short unsigned int *’}
 1441 | static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) {
      |                                        ~~~~~~~~~~~~~^

this retrieves the error:

Nov 15 00:16:07 kamala ollama[31786]: [GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
Nov 15 00:16:07 kamala ollama[31786]: [GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
Nov 15 00:16:07 kamala ollama[31786]: [GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
Nov 15 00:16:07 kamala ollama[31786]: [GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
Nov 15 00:16:07 kamala ollama[31786]: time=2024-11-15T00:16:07.508Z level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.0.0)"
Nov 15 00:16:07 kamala ollama[31786]: time=2024-11-15T00:16:07.508Z level=ERROR source=common.go:276 msg="empty runner dir"
Nov 15 00:16:07 kamala ollama[31786]: time=2024-11-15T00:16:07.508Z level=INFO source=common.go:50 msg="Dynamic LLM libraries" runners=[]
Nov 15 00:16:07 kamala ollama[31786]: Error: unable to initialize llm runners unable to locate runners in any search path [/usr/local/bin/llama/build/linux-amd64/runners /usr/local/lib/ollama/runners]
Nov 15 00:16:07 kamala systemd[1]: ollama.service: Main process exited, code=exited, status=1/FAILURE
Nov 15 00:16:07 kamala systemd[1]: ollama.service: Failed with result 'exit-code'.

but, if i run it from the local build folder:

⠙
>>>
>>> test
Hello! How can I assist you today? If you have any questions or need help with something specific, feel free to let me know.

>>> Send a message (/? for help)
<!-- gh-comment-id:2477708328 --> @unclemusclez commented on GitHub (Nov 15, 2024): @dhiltgen `/usr/local/bin/llama/build/linux-amd64/runners` `/usr/local/lib/ollama/runners` seem to be at least part of the issue. is there a `make install` we are supposed to be using? if i copy the compiled locations `./ollama/runners` and `./ollama/llama/build/linux-amd64/runners` to `/usr/local/lib/ollama/runners` it seems to load. aside from the compilation of all of the drivers and manual migration of the runners, it's working on ROCm 6.2.4 Ubuntu 24.04 ``` sudo systemctl status ollama ● ollama.service - Ollama Service Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled) Active: active (running) since Fri 2024-11-15 00:38:11 UTC; 2min 26s ago Main PID: 40074 (ollama) Tasks: 23 (limit: 9830) Memory: 1.2G (peak: 1.2G) CPU: 1min 901ms CGroup: /system.slice/ollama.service ├─40074 /usr/local/bin/ollama serve └─40204 /usr/local/lib/ollama/runners/rocm_avx/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-a39d684d796a7257d3d75c6c360d6ce85d9219156441f3b6f4d77b2b91ca9147 --ctx-siz> Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: freq_base = 1000000.0 Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: freq_scale = 1 Nov 15 00:40:36 kamala ollama[40074]: llama_kv_cache_init: ROCm0 KV buffer size = 2048.00 MiB Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: ROCm_Host output buffer size = 2.40 MiB Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: ROCm0 compute buffer size = 696.00 MiB Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: ROCm_Host compute buffer size = 26.01 MiB Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: graph nodes = 2246 Nov 15 00:40:36 kamala ollama[40074]: llama_new_context_with_model: graph splits = 2 Nov 15 00:40:36 kamala ollama[40074]: time=2024-11-15T00:40:36.556Z level=INFO source=server.go:601 msg="llama runner started in 127.15 seconds" ``` ```bash AMDGPU_TARGETS="gfx906" make -j16 GOARCH=amd64 go build -buildmode=pie "-ldflags=-w -s \"-X=github.com/ollama/ollama/version.Version=0.4.1-15-g413fb24\" \"-X=github.com/ollama/ollama/llama.CpuFeatures=\" " -trimpath -o build/linux-amd64/runners/cpu/ollama_llama_server ./runner GOARCH=amd64 go build -buildmode=pie "-ldflags=-w -s \"-X=github.com/ollama/ollama/version.Version=0.4.1-15-g413fb24\" \"-X=github.com/ollama/ollama/llama.CpuFeatures="avx"\" " -trimpath -tags "avx" -o build/linux-amd64/runners/cpu_avx/ollama_llama_server ./runner GOARCH=amd64 go build -buildmode=pie "-ldflags=-w -s \"-X=github.com/ollama/ollama/version.Version=0.4.1-15-g413fb24\" \"-X=github.com/ollama/ollama/llama.CpuFeatures="avx,avx2"\" " -trimpath -tags "avx,avx2" -o build/linux-amd64/runners/cpu_avx2/ollama_llama_server ./runner /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda.rocm.o ggml-cuda.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/acc.rocm.o ggml-cuda/acc.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/arange.rocm.o ggml-cuda/arange.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/argsort.rocm.o ggml-cuda/argsort.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/binbcast.rocm.o ggml-cuda/binbcast.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/clamp.rocm.o ggml-cuda/clamp.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/concat.rocm.o ggml-cuda/concat.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/conv-transpose-1d.rocm.o ggml-cuda/conv-transpose-1d.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/convert.rocm.o ggml-cuda/convert.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/cpy.rocm.o ggml-cuda/cpy.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/cross-entropy-loss.rocm.o ggml-cuda/cross-entropy-loss.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/diagmask.rocm.o ggml-cuda/diagmask.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/dmmv.rocm.o ggml-cuda/dmmv.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/im2col.rocm.o ggml-cuda/im2col.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/getrows.rocm.o ggml-cuda/getrows.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/mmq.rocm.o ggml-cuda/mmq.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/mmvq.rocm.o ggml-cuda/mmvq.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/norm.rocm.o ggml-cuda/norm.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/opt-step-adamw.rocm.o ggml-cuda/opt-step-adamw.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/out-prod.rocm.o ggml-cuda/out-prod.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/pad.rocm.o ggml-cuda/pad.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/pool2d.rocm.o ggml-cuda/pool2d.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/quantize.rocm.o ggml-cuda/quantize.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/rope.rocm.o ggml-cuda/rope.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/rwkv-wkv.rocm.o ggml-cuda/rwkv-wkv.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/scale.rocm.o ggml-cuda/scale.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/softmax.rocm.o ggml-cuda/softmax.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/sum.rocm.o ggml-cuda/sum.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/sumrows.rocm.o ggml-cuda/sumrows.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/tsembd.rocm.o ggml-cuda/tsembd.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/unary.rocm.o ggml-cuda/unary.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/upscale.rocm.o ggml-cuda/upscale.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq1_s.rocm.o ggml-cuda/template-instances/mmq-instance-iq1_s.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_s.rocm.o ggml-cuda/template-instances/mmq-instance-iq2_s.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_xs.rocm.o ggml-cuda/template-instances/mmq-instance-iq2_xs.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_xxs.rocm.o ggml-cuda/template-instances/mmq-instance-iq2_xxs.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq3_s.rocm.o ggml-cuda/template-instances/mmq-instance-iq3_s.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq3_xxs.rocm.o ggml-cuda/template-instances/mmq-instance-iq3_xxs.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq4_nl.rocm.o ggml-cuda/template-instances/mmq-instance-iq4_nl.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq4_xs.rocm.o ggml-cuda/template-instances/mmq-instance-iq4_xs.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q2_k.rocm.o ggml-cuda/template-instances/mmq-instance-q2_k.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q3_k.rocm.o ggml-cuda/template-instances/mmq-instance-q3_k.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_0.rocm.o ggml-cuda/template-instances/mmq-instance-q4_0.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_1.rocm.o ggml-cuda/template-instances/mmq-instance-q4_1.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_k.rocm.o ggml-cuda/template-instances/mmq-instance-q4_k.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_0.rocm.o ggml-cuda/template-instances/mmq-instance-q5_0.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_1.rocm.o ggml-cuda/template-instances/mmq-instance-q5_1.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_k.rocm.o ggml-cuda/template-instances/mmq-instance-q5_k.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q6_k.rocm.o ggml-cuda/template-instances/mmq-instance-q6_k.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q8_0.rocm.o ggml-cuda/template-instances/mmq-instance-q8_0.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -o build/linux-amd64/ggml.rocm.o ggml.c /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -o build/linux-amd64/ggml-backend.rocm.o ggml-backend.c /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -o build/linux-amd64/ggml-alloc.rocm.o ggml-alloc.c /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -o build/linux-amd64/ggml-quants.rocm.o ggml-quants.c /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -o build/linux-amd64/sgemm.rocm.o sgemm.cpp /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -o build/linux-amd64/ggml-aarch64.rocm.o ggml-aarch64.c /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/fattn-tile-f16.rocm.o ggml-cuda/fattn-tile-f16.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/fattn-tile-f32.rocm.o ggml-cuda/fattn-tile-f32.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/fattn.rocm.o ggml-cuda/fattn.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb16.rocm.o ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb16.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb32.rocm.o ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb32.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb16.rocm.o ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb16.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb32.rocm.o ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb32.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb8.rocm.o ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb8.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q4_0-q4_0.rocm.o ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q4_0-q4_0.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q4_0-q4_0.rocm.o ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q4_0-q4_0.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q8_0-q8_0.rocm.o ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q8_0-q8_0.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q8_0-q8_0.rocm.o ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q8_0-q8_0.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-f16-f16.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs256-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f16-instance-hs256-f16-f16.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs64-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f16-instance-hs64-f16-f16.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-f16-f16.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs256-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f32-instance-hs256-f16-f16.cu /usr/bin/ccache /opt/rocm/bin/hipcc -c -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=gnu++11 -mavx -mf16c -mfma -parallel-jobs=2 -c -O3 -DGGML_USE_CUDA -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_HIPBLAS -DGGML_USE_LLAMAFILE -DHIP_FAST_MATH -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -DNDEBUG -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -DUSE_PROF_API=1 -std=gnu++14 -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -Wno-expansion-to-defined -Wno-invalid-noreturn -Wno-ignored-attributes -Wno-pass-failed -Wno-deprecated-declarations -Wno-unused-result -I. --offload-arch=gfx900 --offload-arch=gfx940 --offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1010 --offload-arch=gfx1012 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -o build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs64-f16-f16.rocm.o ggml-cuda/template-instances/fattn-vec-f32-instance-hs64-f16-f16.cu /usr/bin/ccache /opt/rocm/bin/hipcc --shared -L/opt/rocm/lib -lamdhip64 -L../dist/linux-amd64/lib/ollama -lhipblas -lrocblas ./build/linux-amd64/ggml-cuda.rocm.o ./build/linux-amd64/ggml-cuda/acc.rocm.o ./build/linux-amd64/ggml-cuda/arange.rocm.o ./build/linux-amd64/ggml-cuda/argsort.rocm.o ./build/linux-amd64/ggml-cuda/binbcast.rocm.o ./build/linux-amd64/ggml-cuda/clamp.rocm.o ./build/linux-amd64/ggml-cuda/concat.rocm.o ./build/linux-amd64/ggml-cuda/conv-transpose-1d.rocm.o ./build/linux-amd64/ggml-cuda/convert.rocm.o ./build/linux-amd64/ggml-cuda/cpy.rocm.o ./build/linux-amd64/ggml-cuda/cross-entropy-loss.rocm.o ./build/linux-amd64/ggml-cuda/diagmask.rocm.o ./build/linux-amd64/ggml-cuda/dmmv.rocm.o ./build/linux-amd64/ggml-cuda/getrows.rocm.o ./build/linux-amd64/ggml-cuda/im2col.rocm.o ./build/linux-amd64/ggml-cuda/mmq.rocm.o ./build/linux-amd64/ggml-cuda/mmvq.rocm.o ./build/linux-amd64/ggml-cuda/norm.rocm.o ./build/linux-amd64/ggml-cuda/opt-step-adamw.rocm.o ./build/linux-amd64/ggml-cuda/out-prod.rocm.o ./build/linux-amd64/ggml-cuda/pad.rocm.o ./build/linux-amd64/ggml-cuda/pool2d.rocm.o ./build/linux-amd64/ggml-cuda/quantize.rocm.o ./build/linux-amd64/ggml-cuda/rope.rocm.o ./build/linux-amd64/ggml-cuda/rwkv-wkv.rocm.o ./build/linux-amd64/ggml-cuda/scale.rocm.o ./build/linux-amd64/ggml-cuda/softmax.rocm.o ./build/linux-amd64/ggml-cuda/sum.rocm.o ./build/linux-amd64/ggml-cuda/sumrows.rocm.o ./build/linux-amd64/ggml-cuda/tsembd.rocm.o ./build/linux-amd64/ggml-cuda/unary.rocm.o ./build/linux-amd64/ggml-cuda/upscale.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq1_s.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_s.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_xs.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq2_xxs.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq3_s.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq3_xxs.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq4_nl.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-iq4_xs.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q2_k.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q3_k.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_1.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q4_k.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_1.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q5_k.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q6_k.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/mmq-instance-q8_0.rocm.o ./build/linux-amd64/ggml.rocm.o ./build/linux-amd64/ggml-backend.rocm.o ./build/linux-amd64/ggml-alloc.rocm.o ./build/linux-amd64/ggml-quants.rocm.o ./build/linux-amd64/sgemm.rocm.o ./build/linux-amd64/ggml-aarch64.rocm.o ./build/linux-amd64/ggml-cuda/fattn-tile-f16.rocm.o ./build/linux-amd64/ggml-cuda/fattn-tile-f32.rocm.o ./build/linux-amd64/ggml-cuda/fattn.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb32.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb32.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb8.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q4_0-q4_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q4_0-q4_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q8_0-q8_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q8_0-q8_0.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-f16-f16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs256-f16-f16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f16-instance-hs64-f16-f16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-f16-f16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs256-f16-f16.rocm.o ./build/linux-amd64/ggml-cuda/template-instances/fattn-vec-f32-instance-hs64-f16-f16.rocm.o -o build/linux-amd64/runners/rocm_avx/libggml_rocm.so GOARCH=amd64 CGO_LDFLAGS="-L"./build/linux-amd64/runners/rocm_avx/" " go build -buildmode=pie "-ldflags=-w -s \"-X=github.com/ollama/ollama/version.Version=0.4.1-15-g413fb24\" \"-X=github.com/ollama/ollama/llama.CpuFeatures=avx\" " -trimpath -tags avx,rocm -o build/linux-amd64/runners/rocm_avx/ollama_llama_server ./runner make[2]: Nothing to be done for 'exe'. make[2]: Nothing to be done for 'exe'. # github.com/ollama/ollama/llama ggml.c: In function ‘ggml_vec_mad_f16’: ggml.c:2378:45: warning: passing argument 1 of ‘__avx_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers] 2378 | ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j); | ^ ggml.c:1458:51: note: in definition of macro ‘GGML_F32Cx8_LOAD’ 1458 | #define GGML_F32Cx8_LOAD(x) __avx_f32cx8_load(x) | ^ ggml.c:2378:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’ 2378 | ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j); | ^~~~~~~~~~~~~~~~~ ggml.c:1441:53: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_fp16_t *’ {aka ‘const short unsigned int *’} 1441 | static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) { | ~~~~~~~~~~~~~^ # github.com/ollama/ollama/llama ggml-aarch64.c: In function ‘ggml_gemv_q4_0_8x8_q8_0’: ggml-aarch64.c:978:81: warning: passing argument 1 of ‘__avx_rearranged_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers] 978 | const __m256 col_scale_f32 = GGML_F32Cx8_REARRANGE_LOAD(b_ptr[b].d, changemask); | ~~~~~~~~^~ ggml-aarch64.c:142:85: note: in definition of macro ‘GGML_F32Cx8_REARRANGE_LOAD’ 142 | #define GGML_F32Cx8_REARRANGE_LOAD(x, arrangeMask) __avx_rearranged_f32cx8_load(x, arrangeMask) | ^ ggml-aarch64.c:128:64: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_half *’ {aka ‘const short unsigned int *’} 128 | static inline __m256 __avx_rearranged_f32cx8_load(ggml_fp16_t *x, __m128i arrangeMask) { | ~~~~~~~~~~~~~^ ggml-aarch64.c: In function ‘ggml_gemm_q4_0_8x8_q8_0’: ggml-aarch64.c:2943:75: warning: passing argument 1 of ‘__avx_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers] 2943 | const __m256 col_scale_f32 = GGML_F32Cx8_LOAD(b_ptr[b].d); | ~~~~~~~~^~ ggml-aarch64.c:140:51: note: in definition of macro ‘GGML_F32Cx8_LOAD’ 140 | #define GGML_F32Cx8_LOAD(x) __avx_f32cx8_load(x) | ^ ggml-aarch64.c:109:53: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_half *’ {aka ‘const short unsigned int *’} 109 | static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) { | ~~~~~~~~~~~~~^ ggml-aarch64.c:3020:91: warning: passing argument 1 of ‘__avx_repeat_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers] 3020 | const __m256 row_scale_f32 = GGML_F32Cx8_REPEAT_LOAD(a_ptrs[rp][b].d, loadMask); | ~~~~~~~~~~~~~^~ ggml-aarch64.c:141:75: note: in definition of macro ‘GGML_F32Cx8_REPEAT_LOAD’ 141 | #define GGML_F32Cx8_REPEAT_LOAD(x, loadMask) __avx_repeat_f32cx8_load(x) | ^ ggml-aarch64.c:118:60: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_half *’ {aka ‘const short unsigned int *’} 118 | static inline __m256 __avx_repeat_f32cx8_load(ggml_fp16_t *x) { | ~~~~~~~~~~~~~^ ggml-aarch64.c:3107:75: warning: passing argument 1 of ‘__avx_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers] 3107 | const __m256 col_scale_f32 = GGML_F32Cx8_LOAD(b_ptr[b].d); | ~~~~~~~~^~ ggml-aarch64.c:140:51: note: in definition of macro ‘GGML_F32Cx8_LOAD’ 140 | #define GGML_F32Cx8_LOAD(x) __avx_f32cx8_load(x) | ^ ggml-aarch64.c:109:53: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_half *’ {aka ‘const short unsigned int *’} 109 | static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) { | ~~~~~~~~~~~~~^ ggml-aarch64.c:3185:82: warning: passing argument 1 of ‘__avx_repeat_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers] 3185 | const __m256 row_scale_f32 = GGML_F32Cx8_REPEAT_LOAD(a_ptr[b].d, loadMask); | ~~~~~~~~^~ ggml-aarch64.c:141:75: note: in definition of macro ‘GGML_F32Cx8_REPEAT_LOAD’ 141 | #define GGML_F32Cx8_REPEAT_LOAD(x, loadMask) __avx_repeat_f32cx8_load(x) | ^ ggml-aarch64.c:118:60: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_half *’ {aka ‘const short unsigned int *’} 118 | static inline __m256 __avx_repeat_f32cx8_load(ggml_fp16_t *x) { | ~~~~~~~~~~~~~^ # github.com/ollama/ollama/llama ggml.c: In function ‘ggml_vec_mad_f16’: ggml.c:2378:45: warning: passing argument 1 of ‘__avx_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers] 2378 | ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j); | ^ ggml.c:1458:51: note: in definition of macro ‘GGML_F32Cx8_LOAD’ 1458 | #define GGML_F32Cx8_LOAD(x) __avx_f32cx8_load(x) | ^ ggml.c:2378:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’ 2378 | ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j); | ^~~~~~~~~~~~~~~~~ ggml.c:1441:53: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_fp16_t *’ {aka ‘const short unsigned int *’} 1441 | static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) { | ~~~~~~~~~~~~~^ # github.com/ollama/ollama/llama ggml.c: In function ‘ggml_vec_mad_f16’: ggml.c:2378:45: warning: passing argument 1 of ‘__avx_f32cx8_load’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers] 2378 | ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j); | ^ ggml.c:1458:51: note: in definition of macro ‘GGML_F32Cx8_LOAD’ 1458 | #define GGML_F32Cx8_LOAD(x) __avx_f32cx8_load(x) | ^ ggml.c:2378:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’ 2378 | ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j); | ^~~~~~~~~~~~~~~~~ ggml.c:1441:53: note: expected ‘ggml_fp16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘const ggml_fp16_t *’ {aka ‘const short unsigned int *’} 1441 | static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) { | ~~~~~~~~~~~~~^ ``` this retrieves the error: ``` Nov 15 00:16:07 kamala ollama[31786]: [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) Nov 15 00:16:07 kamala ollama[31786]: [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) Nov 15 00:16:07 kamala ollama[31786]: [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) Nov 15 00:16:07 kamala ollama[31786]: [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) Nov 15 00:16:07 kamala ollama[31786]: time=2024-11-15T00:16:07.508Z level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.0.0)" Nov 15 00:16:07 kamala ollama[31786]: time=2024-11-15T00:16:07.508Z level=ERROR source=common.go:276 msg="empty runner dir" Nov 15 00:16:07 kamala ollama[31786]: time=2024-11-15T00:16:07.508Z level=INFO source=common.go:50 msg="Dynamic LLM libraries" runners=[] Nov 15 00:16:07 kamala ollama[31786]: Error: unable to initialize llm runners unable to locate runners in any search path [/usr/local/bin/llama/build/linux-amd64/runners /usr/local/lib/ollama/runners] Nov 15 00:16:07 kamala systemd[1]: ollama.service: Main process exited, code=exited, status=1/FAILURE Nov 15 00:16:07 kamala systemd[1]: ollama.service: Failed with result 'exit-code'. ``` but, if i run it from the local build folder: ``` ⠙ >>> >>> test Hello! How can I assist you today? If you have any questions or need help with something specific, feel free to let me know. >>> Send a message (/? for help) ```
Author
Owner

@rainbyte commented on GitHub (Nov 16, 2024):

I just installed these updates:

  • rocblas 6.2.2-2
  • ollama-rocm 0.4.2-1

And run ollama with this command:

ollama serve

Then I got this error as before:

/tmp/ollama3240664242/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory
time=2024-11-16T03:30:42.832-03:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 127"

To make it work I had to do this:

cp /usr/lib/ollama/libggml_rocm.so /tmp/ollama1482905245/runners/rocm/

EDIT:

I'm using RX 7900 XTX gpu

<!-- gh-comment-id:2480450083 --> @rainbyte commented on GitHub (Nov 16, 2024): I just installed these updates: - rocblas 6.2.2-2 - ollama-rocm 0.4.2-1 And run ollama with this command: ``` ollama serve ``` Then I got this error as before: ```txt /tmp/ollama3240664242/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory time=2024-11-16T03:30:42.832-03:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 127" ``` To make it work I had to do this: ```sh cp /usr/lib/ollama/libggml_rocm.so /tmp/ollama1482905245/runners/rocm/ ``` ----- EDIT: I'm using RX 7900 XTX gpu
Author
Owner

@zw963 commented on GitHub (Nov 16, 2024):

I just installed these updates:

* rocblas 6.2.2-2

* ollama-rocm 0.4.2-1

And run ollama with this command:

ollama serve

Then I got this error as before:

/tmp/ollama3240664242/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory
time=2024-11-16T03:30:42.832-03:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 127"

To make it work I had to do this:

cp /usr/lib/ollama/libggml_rocm.so /tmp/ollama1482905245/runners/rocm/

EDIT:

I'm using RX 7900 XTX gpu

Cool, thanks you, it works for me, i780M, the only issue is, the tmp folder name ollama1482905245 is randome between start, i have to write a script to find it and copy libggml_rocm.so there.

<!-- gh-comment-id:2480453988 --> @zw963 commented on GitHub (Nov 16, 2024): > I just installed these updates: > > * rocblas 6.2.2-2 > > * ollama-rocm 0.4.2-1 > > > And run ollama with this command: > > ``` > ollama serve > ``` > > Then I got this error as before: > > ``` > /tmp/ollama3240664242/runners/rocm/ollama_llama_server: error while loading shared libraries: libggml_rocm.so: cannot open shared object file: No such file or directory > time=2024-11-16T03:30:42.832-03:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 127" > ``` > > To make it work I had to do this: > > ```shell > cp /usr/lib/ollama/libggml_rocm.so /tmp/ollama1482905245/runners/rocm/ > ``` > > EDIT: > > I'm using RX 7900 XTX gpu Cool, thanks you, it works for me, i780M, the only issue is, the tmp folder name ollama1482905245 is randome between start, i have to write a script to find it and copy `libggml_rocm.so` there.
Author
Owner

@rainbyte commented on GitHub (Nov 16, 2024):

@zw963 exactly that

it seems each time you launch ollama it will create a different folder inside /tmp

maybe it should copy both files? or add /usr/lib/ollama to library path


EDIT:
I just tried this command to add the library path

LD_LIBRARY_PATH=/usr/lib/ollama/ ollama serve

And it works without copying the .so file

<!-- gh-comment-id:2480458357 --> @rainbyte commented on GitHub (Nov 16, 2024): @zw963 exactly that it seems each time you launch ollama it will create a different folder inside /tmp maybe it should copy both files? or add `/usr/lib/ollama` to library path ----- EDIT: I just tried this command to add the library path ```sh LD_LIBRARY_PATH=/usr/lib/ollama/ ollama serve ``` And it works without copying the .so file
Author
Owner

@svenstaro commented on GitHub (Nov 16, 2024):

This is clearly a downstream problem in Arch and I suggest we move over there and leave upstream in peace: https://gitlab.archlinux.org/archlinux/packaging/packages/ollama-rocm/-/issues/3

<!-- gh-comment-id:2480461616 --> @svenstaro commented on GitHub (Nov 16, 2024): This is clearly a downstream problem in Arch and I suggest we move over there and leave upstream in peace: https://gitlab.archlinux.org/archlinux/packaging/packages/ollama-rocm/-/issues/3
Author
Owner

@zw963 commented on GitHub (Nov 16, 2024):

This is clearly a downstream problem in Arch and I suggest we move over there and leave upstream in peace

In fact, I run ollama which downloaded from github directly before, it work well before rolling to latest arch linux several days ago with rocm updates to 6.2.2, so, i consider still somethings changed in rocm related package changes cause this.

<!-- gh-comment-id:2480475683 --> @zw963 commented on GitHub (Nov 16, 2024): > This is clearly a downstream problem in Arch and I suggest we move over there and leave upstream in peace In fact, I run ollama which downloaded from github directly before, it work well before rolling to latest arch linux several days ago with rocm updates to 6.2.2, so, i consider still somethings changed in rocm related package changes cause this.
Author
Owner

@kode54 commented on GitHub (Nov 16, 2024):

Arch has fixed their packaging. It required a newer rocblas package, and a rebuild of the ollama-rocm package.

<!-- gh-comment-id:2480478988 --> @kode54 commented on GitHub (Nov 16, 2024): Arch has fixed their packaging. It required a newer rocblas package, and a rebuild of the ollama-rocm package.
Author
Owner

@Tealk commented on GitHub (Nov 16, 2024):

Yes, the fixes of the Arch packages worked for me.

<!-- gh-comment-id:2480505178 --> @Tealk commented on GitHub (Nov 16, 2024): Yes, the fixes of the Arch packages worked for me.
Author
Owner

@zw963 commented on GitHub (Nov 16, 2024):

Yes, it's works for me for Arch when update to ollama-rocm-0.4.2-2

<!-- gh-comment-id:2480544054 --> @zw963 commented on GitHub (Nov 16, 2024): Yes, it's works for me for Arch when update to ollama-rocm-0.4.2-2
Author
Owner

@Chris2000SP commented on GitHub (Nov 16, 2024):

Yeah, it works for me too. EDIT: After Updating it works now with newest rocm versions under Arch.

<!-- gh-comment-id:2480750045 --> @Chris2000SP commented on GitHub (Nov 16, 2024): Yeah, it works for me too. EDIT: After Updating it works now with newest rocm versions under Arch.
Author
Owner

@svenstaro commented on GitHub (Nov 18, 2024):

Can this ticket be closed then?

<!-- gh-comment-id:2481761281 --> @svenstaro commented on GitHub (Nov 18, 2024): Can this ticket be closed then?
Author
Owner

@unclemusclez commented on GitHub (Nov 20, 2024):

it seems each time you launch ollama it will create a different folder inside /tmp

maybe it should copy both files? or add /usr/lib/ollama to library path

EDIT: I just tried this command to add the library path

LD_LIBRARY_PATH=/usr/lib/ollama/ ollama serve

And it works without copying the .so file

@rainbyte i am still running into this issue on the main branch.
you are upgrading libggml_rocm.so when you compile ollama and the ENV path is just pointing to that same location, correct?
in other words, do we need to make sure libggml_rocm.so is up to date?

<!-- gh-comment-id:2489427866 --> @unclemusclez commented on GitHub (Nov 20, 2024): > it seems each time you launch ollama it will create a different folder inside /tmp > > maybe it should copy both files? or add `/usr/lib/ollama` to library path > > EDIT: I just tried this command to add the library path > > ```shell > LD_LIBRARY_PATH=/usr/lib/ollama/ ollama serve > ``` > > And it works without copying the .so file @rainbyte i am still running into this issue on the main branch. you are upgrading `libggml_rocm.so` when you compile `ollama` and the ENV path is just pointing to that same location, correct? in other words, do we need to make sure `libggml_rocm.so` is up to date?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30577