[GH-ISSUE #6857] Issues getting rocm support to compile on Gentoo #30087

Closed
opened 2026-04-22 09:32:49 -05:00 by GiteaMirror · 19 comments
Owner

Originally created by @kiaraly on GitHub (Sep 18, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6857

What is the issue?

I'm trying to get the project to compile on Gentoo but am running into some issues as Gentoo uses different paths.

On Gentoo, rocm libraries get installed into /usr/lib64, hip-clang lives somewhere else, and I'm sure there are some other differences as well.

As suggested in the wiki, I set the following environment variables to point the build script to the right point ROCM_PATH=/usr/lib64 CLBlast_DIR=/usr/lib64/cmake/CLBlast. This got me a bit further, but compilation still failed because the compiler paths were wrong.

I edited gen_linux.sh and changed the cmake definition for rocm

CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DGGML_HIPBLAS=on 
-DGGML_CUDA_NO_PEER_COPY=on -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang 
-DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)"

to

CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DGGML_HIPBLAS=on 
-DGGML_CUDA_NO_PEER_COPY=on -DCMAKE_C_COMPILER=$(hipconfig -l)/clang 
-DCMAKE_CXX_COMPILER=$(hipconfig -l)/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)"

(this seems to be how llama sets their HIPCXX path and points to the correct path for me). This got me one step further again, but this time it complained about not finding some cmake files. Looking at the llama documentation again it sets HIP_PATH for compilation as well (though wrong) and I modified the build function to export

export HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)"

before compilation.

After that, the project compiles correctly, but trying to load any model crashes ollama. The ollama serve process reports

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 32, N: 2, K: 256, alpha: 1, row_stride_a: 1, col_stride_a: 1024, row_stride_b: 1, col_stride_b: 2048, row_stride_c: 1, col_stride_c: 32, row_stride_d: 1, col_stride_d: 32, beta: 0, batch_count: 8, strided_batch: false, stride_a: 32768, stride_b: 4096, stride_c: 64, stride_d: 64, atomics_mode: atomics_allowed }
Alpha value -0.0281982 doesn't match that set in problem: 1
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
CUDA error: CUBLAS_STATUS_INTERNAL_ERROR
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890
  hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
/home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error
time=2024-09-18T14:50:34.883+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
time=2024-09-18T14:50:36.936+02:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR\n  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890\n  hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)\n/home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error"

the ollama run process crashes with

Error: llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890
  hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
/home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error

I can't make any sense of these errors and don't know what else to try.

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

git head

Originally created by @kiaraly on GitHub (Sep 18, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6857 ### What is the issue? I'm trying to get the project to compile on Gentoo but am running into some issues as Gentoo uses different paths. On Gentoo, rocm libraries get installed into /usr/lib64, hip-clang lives somewhere else, and I'm sure there are some other differences as well. As suggested in the wiki, I set the following environment variables to point the build script to the right point `ROCM_PATH=/usr/lib64 CLBlast_DIR=/usr/lib64/cmake/CLBlast`. This got me a bit further, but compilation still failed because the compiler paths were wrong. I edited gen_linux.sh and changed the cmake definition for rocm ``` CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DGGML_HIPBLAS=on -DGGML_CUDA_NO_PEER_COPY=on -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)" ``` to ``` CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DGGML_HIPBLAS=on -DGGML_CUDA_NO_PEER_COPY=on -DCMAKE_C_COMPILER=$(hipconfig -l)/clang -DCMAKE_CXX_COMPILER=$(hipconfig -l)/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)" ``` ([this](https://github.com/ggerganov/llama.cpp/blob/8962422b1c6f9b8b15f5aeaea42600bcc2d44177/docs/build.md#hipblas) seems to be how llama sets their `HIPCXX` path and points to the correct path for me). This got me one step further again, but this time it complained about not finding some cmake files. Looking at the llama documentation again it sets `HIP_PATH` for compilation as well (though wrong) and I modified the build function to export ``` export HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)" ``` before compilation. After that, the project compiles correctly, but trying to load any model crashes ollama. The `ollama serve` process reports ``` rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 32, N: 2, K: 256, alpha: 1, row_stride_a: 1, col_stride_a: 1024, row_stride_b: 1, col_stride_b: 2048, row_stride_c: 1, col_stride_c: 32, row_stride_d: 1, col_stride_d: 32, beta: 0, batch_count: 8, strided_batch: false, stride_a: 32768, stride_b: 4096, stride_c: 64, stride_d: 64, atomics_mode: atomics_allowed } Alpha value -0.0281982 doesn't match that set in problem: 1 This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. CUDA error: CUBLAS_STATUS_INTERNAL_ERROR current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890 hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT) /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error time=2024-09-18T14:50:34.883+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" time=2024-09-18T14:50:36.936+02:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR\n current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890\n hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)\n/home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error" ``` the `ollama run` process crashes with ``` Error: llama runner process has terminated: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1890 hipblasGemmBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT) /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error ``` I can't make any sense of these errors and don't know what else to try. ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version git head
GiteaMirror added the bug label 2026-04-22 09:32:49 -05:00
Author
Owner

@dhiltgen commented on GitHub (Sep 21, 2024):

We're working on some changes that should make it a bit easier to adapt paths on other OSes going forward. #5034

I'm not sure what this failure is, but it may be rpath related, or possible missing pieces ROCm is expecting in relative or absolute paths. You could try setting AMD_LOG_LEVEL=3 which will cause a lot of verbose logs from the various ROCm libraries which might help narrow it down.

<!-- gh-comment-id:2364782947 --> @dhiltgen commented on GitHub (Sep 21, 2024): We're working on some changes that should make it a bit easier to adapt paths on other OSes going forward. #5034 I'm not sure what this failure is, but it may be rpath related, or possible missing pieces ROCm is expecting in relative or absolute paths. You could try setting `AMD_LOG_LEVEL=3` which will cause a lot of verbose logs from the various ROCm libraries which might help narrow it down.
Author
Owner

@kiaraly commented on GitHub (Sep 21, 2024):

The log level didn't look interesting—just about 5000 lines of returning hipSuccess. On a second look, I saw that my ROCM_PATH was wrong. It should've been pointing to /usr instead. Fixing that still didn't make it work, so I went one step further and just copied the lib directory from the releases to my /usr/lib64 and it did work! I'm gonna spend some time figuring out which library causes the issue and then come back.

<!-- gh-comment-id:2365165606 --> @kiaraly commented on GitHub (Sep 21, 2024): The log level didn't look interesting—just about 5000 lines of returning hipSuccess. On a second look, I saw that my `ROCM_PATH` was wrong. It should've been pointing to `/usr` instead. Fixing that still didn't make it work, so I went one step further and just copied the lib directory from the releases to my `/usr/lib64` and it did work! I'm gonna spend some time figuring out which library causes the issue and then come back.
Author
Owner

@kiaraly commented on GitHub (Sep 23, 2024):

The issue seems to be with my system install of rocBLAS (librocblas). Could this be something as simple as an incompatible version?

I've had a quick look at the ebuild a1a9b484d8/sci-libs/rocBLAS/rocBLAS-6.1.1.ebuild but neither the patches nor the configure options stood out to me. Could this still be an issue with ollama or should I report it to the gentoo package maintainer?

<!-- gh-comment-id:2368771769 --> @kiaraly commented on GitHub (Sep 23, 2024): The issue seems to be with my system install of rocBLAS (librocblas). Could this be something as simple as an incompatible version? I've had a quick look at the ebuild https://github.com/gentoo/gentoo/blob/a1a9b484d807d3af24aacf2cd6318bc28b8187b5/sci-libs/rocBLAS/rocBLAS-6.1.1.ebuild but neither the patches nor the configure options stood out to me. Could this still be an issue with ollama or should I report it to the gentoo package maintainer?
Author
Owner

@waltercool commented on GitHub (Sep 25, 2024):

Did you found the error?

I been getting the same issue, Gentoo as well.

I'm sure this is something Distro based. I made a bug report few days ago, after I found out ROCM was being compiled with LLVM19 even if the ebuild says LLVM18 for ROCM 6.1

<!-- gh-comment-id:2373064065 --> @waltercool commented on GitHub (Sep 25, 2024): Did you found the error? I been getting the same issue, Gentoo as well. I'm sure this is something Distro based. I made a bug report few days ago, after I found out ROCM was being compiled with LLVM19 even if the ebuild says LLVM18 for ROCM 6.1
Author
Owner

@kiaraly commented on GitHub (Sep 27, 2024):

I have semi broken my PC and haven't been able to do any further testing. I looked at the Gentoo bugs for rocBLAS and only found this https://bugs.gentoo.org/940231. I'll manually try updating the package(s) over the next couple of days and see if that changes anything like the bug report suggests.

<!-- gh-comment-id:2379770623 --> @kiaraly commented on GitHub (Sep 27, 2024): I have semi broken my PC and haven't been able to do any further testing. I looked at the Gentoo bugs for rocBLAS and only found this https://bugs.gentoo.org/940231. I'll manually try updating the package(s) over the next couple of days and see if that changes anything like the bug report suggests.
Author
Owner

@rohitnanda1443 commented on GitHub (Oct 2, 2024):

I just compiled Ollama on Gentoo (after getting frustrated with vllm). Have a Ryzen 8700G / 780m withg 64 GB RAM.

Steps:

  1. Followed the Gentoo ROCm guide: https://wiki.gentoo.org/wiki/ROCm
  2. git clone https://github.com/ollama/ollama.git
    3)cd ollama
  3. export AMDGPU_TARGETS="gfx1100;gfx1102"
  4. go generate ./...
  5. go build .
  6. ./ollama serve & (as the ollama executable is in the ollama directory)
  7. echo "export HSA_OVERRIDE_GFX_VERSION=11.0.0" >> .profile
  8. echo "export HSA_ENABLE_SDMA=0" >> .profile
  9. export HSA_OVERRIDE_GFX_VERSION=11.0.0
  10. export HSA_ENABLE_SDMA=0
  11. ./ollama run mistral:instruct

Output of ./ollama ps
./ollama ps
NAME ID SIZE PROCESSOR UNTIL
mistral:instruct f974a74358d6 6.3 GB 100% GPU 4 minutes from now

Hope this helps.

<!-- gh-comment-id:2387934952 --> @rohitnanda1443 commented on GitHub (Oct 2, 2024): I just compiled Ollama on Gentoo (after getting frustrated with vllm). Have a Ryzen 8700G / 780m withg 64 GB RAM. Steps: 1) Followed the Gentoo ROCm guide: https://wiki.gentoo.org/wiki/ROCm 2) git clone https://github.com/ollama/ollama.git 3)cd ollama 4) export AMDGPU_TARGETS="gfx1100;gfx1102" 5) go generate ./... 6) go build . 7) ./ollama serve & (as the ollama executable is in the ollama directory) 8) echo "export HSA_OVERRIDE_GFX_VERSION=11.0.0" >> .profile 9) echo "export HSA_ENABLE_SDMA=0" >> .profile 10) export HSA_OVERRIDE_GFX_VERSION=11.0.0 11) export HSA_ENABLE_SDMA=0 12) ./ollama run mistral:instruct Output of ./ollama ps ./ollama ps NAME ID SIZE PROCESSOR UNTIL mistral:instruct f974a74358d6 6.3 GB 100% GPU 4 minutes from now Hope this helps.
Author
Owner

@kiaraly commented on GitHub (Oct 2, 2024):

If you run ./ollama serve there's a line in the log like this

time=2024-10-02T13:53:08.292+02:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 rocm rocm_v60102]"

Is rocm included in the runners? When I compiled ollama the first time it wasn't and ./ollama ps still reported that the model was saved in vram but the actual computing was still done on the CPU without the changes mentioned in the original comment.

I finally got my new GPU working and all and the error I'm getting changed. Instead of the old one I now get the following and using the bundled rocBLAS.so no longer fixes the issue.

ggml_cuda_compute_forward: SCALE failed
CUDA error: invalid device function
  current device: 0, in function ggml_cuda_compute_forward at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:2326
  err
/home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error

When I compiled llama.cpp manually the resulting binary has working rocm support. I'm gonna look at how ollama compiles it and see if I can make any progress from there on.

<!-- gh-comment-id:2388464392 --> @kiaraly commented on GitHub (Oct 2, 2024): If you run `./ollama serve` there's a line in the log like this ``` time=2024-10-02T13:53:08.292+02:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 rocm rocm_v60102]" ``` Is rocm included in the runners? When I compiled ollama the first time it wasn't and `./ollama ps` still reported that the model was saved in vram but the actual computing was still done on the CPU without the changes mentioned in the original comment. I finally got my new GPU working and all and the error I'm getting changed. Instead of the old one I now get the following and using the bundled rocBLAS.so no longer fixes the issue. ``` ggml_cuda_compute_forward: SCALE failed CUDA error: invalid device function current device: 0, in function ggml_cuda_compute_forward at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:2326 err /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error ``` When I compiled llama.cpp manually the resulting binary has working rocm support. I'm gonna look at how ollama compiles it and see if I can make any progress from there on.
Author
Owner

@rohitnanda1443 commented on GitHub (Oct 2, 2024):

Output of my ./ollama serve

`2024/10/02 18:34:53 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION:11.0.0 HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-10-02T18:34:53.904+05:30 level=INFO source=images.go:753 msg="total blobs: 11"
time=2024-10-02T18:34:53.904+05:30 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

  • using env: export GIN_MODE=release
  • using code: gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-10-02T18:34:53.905+05:30 level=INFO source=routes.go:1205 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-10-02T18:34:53.905+05:30 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1610614431/runners
time=2024-10-02T18:34:53.919+05:30 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2]"
time=2024-10-02T18:34:53.919+05:30 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
time=2024-10-02T18:34:53.922+05:30 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-10-02T18:34:53.923+05:30 level=INFO source=amd_linux.go:349 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=11.0.0
time=2024-10-02T18:34:53.923+05:30 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx1103 driver=0.0 name=1002:15bf total="16.0 GiB" available="15.4 GiB"
`

<!-- gh-comment-id:2388602816 --> @rohitnanda1443 commented on GitHub (Oct 2, 2024): Output of my ./ollama serve `2024/10/02 18:34:53 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION:11.0.0 HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-10-02T18:34:53.904+05:30 level=INFO source=images.go:753 msg="total blobs: 11" time=2024-10-02T18:34:53.904+05:30 level=INFO source=images.go:760 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-10-02T18:34:53.905+05:30 level=INFO source=routes.go:1205 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-10-02T18:34:53.905+05:30 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1610614431/runners time=2024-10-02T18:34:53.919+05:30 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2]" time=2024-10-02T18:34:53.919+05:30 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" time=2024-10-02T18:34:53.922+05:30 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-10-02T18:34:53.923+05:30 level=INFO source=amd_linux.go:349 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=11.0.0 time=2024-10-02T18:34:53.923+05:30 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx1103 driver=0.0 name=1002:15bf total="16.0 GiB" available="15.4 GiB" `
Author
Owner

@kiaraly commented on GitHub (Oct 2, 2024):

time=2024-10-02T18:34:53.919+05:30 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2]"

I think it's not using your GPU. I noticed it when I loaded a bigger model (maybe try llama3.1?) and saw a massive speed difference between my compiled version and the binary from the releases. Maybe you could try the same?

<!-- gh-comment-id:2388615573 --> @kiaraly commented on GitHub (Oct 2, 2024): > `time=2024-10-02T18:34:53.919+05:30 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2]"` I think it's not using your GPU. I noticed it when I loaded a bigger model (maybe try llama3.1?) and saw a massive speed difference between my compiled version and the binary from the releases. Maybe you could try the same?
Author
Owner

@rohitnanda1443 commented on GitHub (Oct 2, 2024):

Yes you are correct, tried llama-3.1: Similar issues reported by others on Nvidia also: https://github.com/ollama/ollama/issues/4486 (Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support)

Interestingly, I am not getting this issue while running Mistral-7B-v0.3
NAME ID SIZE PROCESSOR UNTIL mistral:instruct f974a74358d6 6.3 GB 100% GPU 4 minutes from now

[GIN] 2024/10/02 - 19:10:51 | 200 | 12.294952ms | 127.0.0.1 | POST "/api/show" time=2024-10-02T19:10:51.823+05:30 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=0 parallel=4 available=16494211072 required="6.2 GiB" time=2024-10-02T19:10:51.823+05:30 level=INFO source=server.go:103 msg="system memory" total="46.6 GiB" free="42.5 GiB" free_swap="63.6 GiB" time=2024-10-02T19:10:51.823+05:30 level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-10-02T19:10:51.825+05:30 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1610614431/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 35323" time=2024-10-02T19:10:51.825+05:30 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-10-02T19:10:51.825+05:30 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" time=2024-10-02T19:10:51.825+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" **WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support |** n_gpu_layers=-1 tid="140523311131776" timestamp=1727876451 INFO [main] build info | build=3670 commit="bf6c2c83" tid="140523311131776" timestamp=1727876451 INFO [main] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140523311131776" timestamp=1727876451 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="35323" tid="140523311131776" timestamp=1727876451 llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 2 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors ⠹ llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.14 MiB ⠼ llm_load_tensors: CPU buffer size = 4437.80 MiB time=2024-10-02T19:10:52.268+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ⠧ llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU output buffer size = 2.02 MiB llama_new_context_with_model: CPU compute buffer size = 560.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 ⠇ INFO [main] model loaded | tid="140523311131776" timestamp=1727876452 ⠏ time=2024-10-02T19:10:52.770+05:30 level=INFO source=server.go:626 msg="llama runner started in 0.95 seconds"

<!-- gh-comment-id:2388711529 --> @rohitnanda1443 commented on GitHub (Oct 2, 2024): Yes you are correct, tried llama-3.1: Similar issues reported by others on Nvidia also: https://github.com/ollama/ollama/issues/4486 (**Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support)** Interestingly, I am not getting this issue while running Mistral-7B-v0.3 `NAME ID SIZE PROCESSOR UNTIL mistral:instruct f974a74358d6 6.3 GB 100% GPU 4 minutes from now ` `[GIN] 2024/10/02 - 19:10:51 | 200 | 12.294952ms | 127.0.0.1 | POST "/api/show" time=2024-10-02T19:10:51.823+05:30 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=0 parallel=4 available=16494211072 required="6.2 GiB" time=2024-10-02T19:10:51.823+05:30 level=INFO source=server.go:103 msg="system memory" total="46.6 GiB" free="42.5 GiB" free_swap="63.6 GiB" time=2024-10-02T19:10:51.823+05:30 level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-10-02T19:10:51.825+05:30 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1610614431/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 35323" time=2024-10-02T19:10:51.825+05:30 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-10-02T19:10:51.825+05:30 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" time=2024-10-02T19:10:51.825+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" **WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support |** n_gpu_layers=-1 tid="140523311131776" timestamp=1727876451 INFO [main] build info | build=3670 commit="bf6c2c83" tid="140523311131776" timestamp=1727876451 INFO [main] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140523311131776" timestamp=1727876451 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="35323" tid="140523311131776" timestamp=1727876451 llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 2 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors ⠹ llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.14 MiB ⠼ llm_load_tensors: CPU buffer size = 4437.80 MiB time=2024-10-02T19:10:52.268+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ⠧ llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU output buffer size = 2.02 MiB llama_new_context_with_model: CPU compute buffer size = 560.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 ⠇ INFO [main] model loaded | tid="140523311131776" timestamp=1727876452 ⠏ time=2024-10-02T19:10:52.770+05:30 level=INFO source=server.go:626 msg="llama runner started in 0.95 seconds"`
Author
Owner

@kiaraly commented on GitHub (Oct 2, 2024):

If you want to try as well you can apply this patch and compile ollama with ROCM_PATH=/usr CLBlast_DIR=/usr/lib64/cmake/CLBlast AMDGPU_TARGETS="gfx1100" go generate './...' (replace the gpu target with your version).

diff --git a/llm/generate/gen_common.sh b/llm/generate/gen_common.sh
index 3825c155..513ac9d2 100644
--- a/llm/generate/gen_common.sh
+++ b/llm/generate/gen_common.sh
@@ -76,6 +76,7 @@ apply_patches() {
 }
 
 build() {
+	export HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)"
     cmake -S ${LLAMACPP_DIR} -B ${BUILD_DIR} ${CMAKE_DEFS}
     cmake --build ${BUILD_DIR} ${CMAKE_TARGETS} -j8
     # remove unnecessary build artifacts
diff --git a/llm/generate/gen_linux.sh b/llm/generate/gen_linux.sh
index 48d08fd0..0eebeab4 100755
--- a/llm/generate/gen_linux.sh
+++ b/llm/generate/gen_linux.sh
@@ -260,11 +260,11 @@ fi
 
 if [ -z "${OLLAMA_SKIP_ROCM_GENERATE}" -a -d "${ROCM_PATH}" ]; then
     echo "ROCm libraries detected - building dynamic ROCm library"
-    if [ -f ${ROCM_PATH}/lib/librocblas.so.*.*.????? ]; then
-        ROCM_VARIANT=_v$(ls ${ROCM_PATH}/lib/librocblas.so.*.*.????? | cut -f5 -d. || true)
+    if [ -f ${ROCM_PATH}/lib64/librocblas.so.*.*.????? ]; then
+        ROCM_VARIANT=_v$(ls ${ROCM_PATH}/lib64/librocblas.so.*.*.????? | cut -f5 -d. || true)
     fi
     init_vars
-    CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DGGML_HIPBLAS=on -DGGML_CUDA_NO_PEER_COPY=on -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)"
+	CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DGGML_HIPBLAS=on -DGGML_CUDA_NO_PEER_COPY=on -DCMAKE_C_COMPILER=$(hipconfig -l)/clang -DCMAKE_CXX_COMPILER=$(hipconfig -l)/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)"
     # Users building from source can tune the exact flags we pass to cmake for configuring llama.cpp
     if [ -n "${OLLAMA_CUSTOM_ROCM_DEFS}" ]; then
         echo "OLLAMA_CUSTOM_ROCM_DEFS=\"${OLLAMA_CUSTOM_ROCM_DEFS}\""
@@ -277,7 +277,7 @@ if [ -z "${OLLAMA_SKIP_ROCM_GENERATE}" -a -d "${ROCM_PATH}" ]; then
     ROCM_DIST_DIR="${DIST_BASE}/../linux-${GOARCH}-rocm/lib/ollama"
     # TODO figure out how to disable runpath (rpath)
     # export CMAKE_HIP_FLAGS="-fno-rtlib-add-rpath" # doesn't work
-    export LLAMA_SERVER_LDFLAGS="-L${ROCM_PATH}/lib -L/opt/amdgpu/lib/x86_64-linux-gnu/ -lhipblas -lrocblas -lamdhip64 -lrocsolver -lamd_comgr -lhsa-runtime64 -lrocsparse -ldrm -ldrm_amdgpu"
+    export LLAMA_SERVER_LDFLAGS="-L${ROCM_PATH}/lib -L${ROCM_PATH}/lib64 -L/opt/amdgpu/lib/x86_64-linux-gnu/ -lhipblas -lrocblas -lamdhip64 -lrocsolver -lamd_comgr -lhsa-runtime64 -lrocsparse -ldrm -ldrm_amdgpu"
     build
 
     # copy the ROCM dependencies

I thought I had made some progress on the error but that seems to have been wrong. Maybe I'll just have to wait until the changes mentioned in https://github.com/ollama/ollama/issues/6857#issuecomment-2364782947 have been done and try again afterwards.

<!-- gh-comment-id:2389065786 --> @kiaraly commented on GitHub (Oct 2, 2024): If you want to try as well you can apply this patch and compile ollama with `ROCM_PATH=/usr CLBlast_DIR=/usr/lib64/cmake/CLBlast AMDGPU_TARGETS="gfx1100" go generate './...'` (replace the gpu target with your version). ```diff diff --git a/llm/generate/gen_common.sh b/llm/generate/gen_common.sh index 3825c155..513ac9d2 100644 --- a/llm/generate/gen_common.sh +++ b/llm/generate/gen_common.sh @@ -76,6 +76,7 @@ apply_patches() { } build() { + export HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)" cmake -S ${LLAMACPP_DIR} -B ${BUILD_DIR} ${CMAKE_DEFS} cmake --build ${BUILD_DIR} ${CMAKE_TARGETS} -j8 # remove unnecessary build artifacts diff --git a/llm/generate/gen_linux.sh b/llm/generate/gen_linux.sh index 48d08fd0..0eebeab4 100755 --- a/llm/generate/gen_linux.sh +++ b/llm/generate/gen_linux.sh @@ -260,11 +260,11 @@ fi if [ -z "${OLLAMA_SKIP_ROCM_GENERATE}" -a -d "${ROCM_PATH}" ]; then echo "ROCm libraries detected - building dynamic ROCm library" - if [ -f ${ROCM_PATH}/lib/librocblas.so.*.*.????? ]; then - ROCM_VARIANT=_v$(ls ${ROCM_PATH}/lib/librocblas.so.*.*.????? | cut -f5 -d. || true) + if [ -f ${ROCM_PATH}/lib64/librocblas.so.*.*.????? ]; then + ROCM_VARIANT=_v$(ls ${ROCM_PATH}/lib64/librocblas.so.*.*.????? | cut -f5 -d. || true) fi init_vars - CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DGGML_HIPBLAS=on -DGGML_CUDA_NO_PEER_COPY=on -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)" + CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DGGML_HIPBLAS=on -DGGML_CUDA_NO_PEER_COPY=on -DCMAKE_C_COMPILER=$(hipconfig -l)/clang -DCMAKE_CXX_COMPILER=$(hipconfig -l)/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)" # Users building from source can tune the exact flags we pass to cmake for configuring llama.cpp if [ -n "${OLLAMA_CUSTOM_ROCM_DEFS}" ]; then echo "OLLAMA_CUSTOM_ROCM_DEFS=\"${OLLAMA_CUSTOM_ROCM_DEFS}\"" @@ -277,7 +277,7 @@ if [ -z "${OLLAMA_SKIP_ROCM_GENERATE}" -a -d "${ROCM_PATH}" ]; then ROCM_DIST_DIR="${DIST_BASE}/../linux-${GOARCH}-rocm/lib/ollama" # TODO figure out how to disable runpath (rpath) # export CMAKE_HIP_FLAGS="-fno-rtlib-add-rpath" # doesn't work - export LLAMA_SERVER_LDFLAGS="-L${ROCM_PATH}/lib -L/opt/amdgpu/lib/x86_64-linux-gnu/ -lhipblas -lrocblas -lamdhip64 -lrocsolver -lamd_comgr -lhsa-runtime64 -lrocsparse -ldrm -ldrm_amdgpu" + export LLAMA_SERVER_LDFLAGS="-L${ROCM_PATH}/lib -L${ROCM_PATH}/lib64 -L/opt/amdgpu/lib/x86_64-linux-gnu/ -lhipblas -lrocblas -lamdhip64 -lrocsolver -lamd_comgr -lhsa-runtime64 -lrocsparse -ldrm -ldrm_amdgpu" build # copy the ROCM dependencies ``` I thought I had made some progress on the error but that seems to have been wrong. Maybe I'll just have to wait until the changes mentioned in https://github.com/ollama/ollama/issues/6857#issuecomment-2364782947 have been done and try again afterwards.
Author
Owner

@waltercool commented on GitHub (Oct 2, 2024):

Using hipconfig is the correct way

<!-- gh-comment-id:2389088476 --> @waltercool commented on GitHub (Oct 2, 2024): Using hipconfig is the correct way
Author
Owner

@ProjectMoon commented on GitHub (Oct 3, 2024):

It seems the guru ebuild on Gentoo doesn't properly compile in GPU support, even when nvidia or amd use flags are enabled. This is my experience using the ebuild (which I just tested briefly). I was able to run it fine, it just offloaded to CPU instead of GPU. Interestingly, the compiled version from the ebuild finds the ROCm device on startup and considers it as an inference resource. But when running the model, the llama.cpp subprocess says it wasn't compiled with GPU support.

<!-- gh-comment-id:2390717945 --> @ProjectMoon commented on GitHub (Oct 3, 2024): It seems the guru ebuild on Gentoo doesn't properly compile in GPU support, even when nvidia or amd use flags are enabled. This is my experience using the ebuild (which I just tested briefly). I _was_ able to run it fine, it just offloaded to CPU instead of GPU. Interestingly, the compiled version from the ebuild finds the ROCm device on startup and considers it as an inference resource. But when running the model, the llama.cpp subprocess says it wasn't compiled with GPU support.
Author
Owner

@kiaraly commented on GitHub (Oct 17, 2024):

I don't know if it's progress but I'm getting a different error now. Given that everything seems to work with a different librocblas.so I've decided to hack a bit on the ebuild. I commented out this part of the ebuild and now get the following error

ggml_cuda_compute_forward: SCALE failed
CUDA error: invalid device function
  current device: 0, in function ggml_cuda_compute_forward at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:2326
  err
/home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error

Compiling llama.cpp manually doesn't have this error and instead just produces gibberish.

<!-- gh-comment-id:2419500881 --> @kiaraly commented on GitHub (Oct 17, 2024): I don't know if it's progress but I'm getting a different error now. Given that everything seems to work with a different librocblas.so I've decided to hack a bit on the ebuild. I commented out [this part of the ebuild](https://github.com/gentoo/gentoo/blob/f9c02033d5657758c53ff90e982b63cf0578b9fa/sci-libs/rocBLAS/rocBLAS-6.1.1.ebuild#L55C1-L58C2) and now get the following error ``` ggml_cuda_compute_forward: SCALE failed CUDA error: invalid device function current device: 0, in function ggml_cuda_compute_forward at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:2326 err /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error ``` Compiling llama.cpp manually doesn't have this error and instead just produces gibberish.
Author
Owner

@dhiltgen commented on GitHub (Oct 24, 2024):

It may need some adjusting, but please give the new Go server build a try. It no longer relies on cmake.

https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner

<!-- gh-comment-id:2434185823 --> @dhiltgen commented on GitHub (Oct 24, 2024): It may need some adjusting, but please give the new Go server build a try. It no longer relies on cmake. https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner
Author
Owner

@kiaraly commented on GitHub (Oct 24, 2024):

With the new Go server building the project was much smoother than before. It compiled correctly and tries to use my GPU but sadly fails. When trying to run gemma2:27b or gemma2:2b I get the error I mentioned in my previous comment. I also tried running llama2-uncensored and got a somewhat different error.

ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: invalid device function
  current device: 0, in function ggml_cuda_compute_forward at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:2326
  err
/home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error

I searched online for RMS_NORM failed but didn't find anything helpful sadly.

Do you think this even is an ollama error? Or should I open an issue in llama.cpp about this?

<!-- gh-comment-id:2435035968 --> @kiaraly commented on GitHub (Oct 24, 2024): With the new Go server building the project was much smoother than before. It compiled correctly and tries to use my GPU but sadly fails. When trying to run gemma2:27b or gemma2:2b I get the error I mentioned in my previous comment. I also tried running llama2-uncensored and got a somewhat different error. ``` ggml_cuda_compute_forward: RMS_NORM failed CUDA error: invalid device function current device: 0, in function ggml_cuda_compute_forward at /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:2326 err /home/roger/Git/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error ``` I searched online for `RMS_NORM failed` but didn't find anything helpful sadly. Do you think this even is an ollama error? Or should I open an issue in llama.cpp about this?
Author
Owner

@stalkerg commented on GitHub (Oct 29, 2024):

Using hipconfig is the correct way

do we have cmake integration with hipconfig?

<!-- gh-comment-id:2443182925 --> @stalkerg commented on GitHub (Oct 29, 2024): > Using hipconfig is the correct way do we have cmake integration with hipconfig?
Author
Owner

@lubosz commented on GitHub (Nov 10, 2024):

@Roger-Roger-debug If you see the CUBLAS_STATUS_INTERNAL_ERROR in hipblasGemmBatchedEx on llama.cpp as well you can hop onto my issue:
https://github.com/ggerganov/llama.cpp/issues/10234

I have the same error on lama.cpp and ollama on Arch Linux.

<!-- gh-comment-id:2466682934 --> @lubosz commented on GitHub (Nov 10, 2024): @Roger-Roger-debug If you see the `CUBLAS_STATUS_INTERNAL_ERROR` in `hipblasGemmBatchedEx` on `llama.cpp` as well you can hop onto my issue: https://github.com/ggerganov/llama.cpp/issues/10234 I have the same error on `lama.cpp` and `ollama` on Arch Linux.
Author
Owner

@kiaraly commented on GitHub (Nov 30, 2024):

I've reported the underlying rocblas issue to the Gentoo bug-tracker (https://bugs.gentoo.org/944820).
With the patch everything works on #7499 so as far as Ollama is concerned this issue can be closed once the PR gets merged.

<!-- gh-comment-id:2508971033 --> @kiaraly commented on GitHub (Nov 30, 2024): I've reported the underlying rocblas issue to the Gentoo bug-tracker (https://bugs.gentoo.org/944820). With the patch everything works on #7499 so as far as Ollama is concerned this issue can be closed once the PR gets merged.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30087