[GH-ISSUE #1756] Older CUDA compute capability 3.5 and 3.7 support #26768

Closed
opened 2026-04-22 03:20:13 -05:00 by GiteaMirror · 84 comments
Owner

Originally created by @orlyandico on GitHub (Jan 1, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1756

Originally assigned to: @dhiltgen on GitHub.

I recently put together an (old) physical machine with an Nvidia K80, which is only supported up to CUDA 11.4 and Nvidia driver 470. All my previous experiments with Ollama were with more modern GPU's.

I found that Ollama doesn't use the GPU at all. I cannot find any documentation on the minimum required CUDA version, and if it is possible to run on older CUDA versions (e.g. Nvidia K80, V100 are still present on cloud, e.g. G2 and P2 on AWS) and there's lots of K80's all over ebay.

EDIT: looking through the logs, it appears that the GPU's are being seen:

Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:300: 24762 MB VRAM available, loading up to 162 GPU layers
Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:436: starting llama runner
Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:494: waiting for llama runner to start responding
Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: found 3 CUDA devices:
Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 0: Tesla K80, compute capability 3.7
Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 1: Tesla K80, compute capability 3.7
Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 2: NVIDIA GeForce GT 730, compute capability 3.5

and

Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: ggml ctx size = 0.11 MiB
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: using CUDA for GPU acceleration
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: mem required = 70.46 MiB
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloading 32 repeating layers to GPU
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloaded 33/33 layers to GPU
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: VRAM used: 3577.61 MiB

but....

Jan 1 20:34:21 thinkstation-s30 ollama[911]: CUDA error 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: no kernel image is available for execution on the device
Jan 1 20:34:21 thinkstation-s30 ollama[911]: current device: 0
Jan 1 20:34:21 thinkstation-s30 ollama[911]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: !"CUDA error"
Jan 1 20:34:22 thinkstation-s30 ollama[911]: 2024/01/01 20:34:22 llama.go:451: 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: no kernel image is available for execution on the device
Jan 1 20:34:22 thinkstation-s30 ollama[911]: current device: 0
Jan 1 20:34:22 thinkstation-s30 ollama[911]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: !"CUDA error"
Jan 1 20:34:22 thinkstation-s30 ollama[911]: 2024/01/01 20:34:22 llama.go:459: error starting llama runner: llama runner process has terminated

Originally created by @orlyandico on GitHub (Jan 1, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1756 Originally assigned to: @dhiltgen on GitHub. I recently put together an (old) physical machine with an Nvidia K80, which is only supported up to CUDA 11.4 and Nvidia driver 470. All my previous experiments with Ollama were with more modern GPU's. I found that Ollama doesn't use the GPU at all. I cannot find any documentation on the minimum required CUDA version, and if it is possible to run on older CUDA versions (e.g. Nvidia K80, V100 are still present on cloud, e.g. G2 and P2 on AWS) and there's lots of K80's all over ebay. EDIT: looking through the logs, it appears that the GPU's are being seen: Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:300: 24762 MB VRAM available, loading up to 162 GPU layers Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:436: starting llama runner Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:494: waiting for llama runner to start responding Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: found 3 CUDA devices: Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 0: Tesla K80, compute capability 3.7 Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 1: Tesla K80, compute capability 3.7 Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 2: NVIDIA GeForce GT 730, compute capability 3.5 and Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: ggml ctx size = 0.11 MiB Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: using CUDA for GPU acceleration Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: mem required = 70.46 MiB Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloading 32 repeating layers to GPU Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloading non-repeating layers to GPU Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloaded 33/33 layers to GPU Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: VRAM used: 3577.61 MiB but.... Jan 1 20:34:21 thinkstation-s30 ollama[911]: CUDA error 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: no kernel image is available for execution on the device Jan 1 20:34:21 thinkstation-s30 ollama[911]: current device: 0 Jan 1 20:34:21 thinkstation-s30 ollama[911]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: !"CUDA error" Jan 1 20:34:22 thinkstation-s30 ollama[911]: 2024/01/01 20:34:22 llama.go:451: 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: no kernel image is available for execution on the device Jan 1 20:34:22 thinkstation-s30 ollama[911]: current device: 0 Jan 1 20:34:22 thinkstation-s30 ollama[911]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: !"CUDA error" Jan 1 20:34:22 thinkstation-s30 ollama[911]: 2024/01/01 20:34:22 llama.go:459: error starting llama runner: llama runner process has terminated
GiteaMirror added the buildnvidiafeature request labels 2026-04-22 03:20:14 -05:00
Author
Owner

@Cybervet commented on GitHub (Jan 2, 2024):

What is your Linux Kernel? I think 6+ kernels don't support a lot of older nvidia cards.

<!-- gh-comment-id:1874327001 --> @Cybervet commented on GitHub (Jan 2, 2024): What is your Linux Kernel? I think 6+ kernels don't support a lot of older nvidia cards.
Author
Owner

@orlyandico commented on GitHub (Jan 2, 2024):

Kernel is 6+ and the setup is supported. I was able to get PyTorch working with CUDA - albeit PyTorch 2.0.1 only since that is the last version that supports CUDA 11.4

The error 209 "no kernel image is available for execution on the device" is for CUDA, not the Linux kernel. Basically the Ollama distribution doesn't have a compiled kernel (via nvcc) for CUDA 11.4 (not even sure if that is supported, if I build from source).

<!-- gh-comment-id:1874347048 --> @orlyandico commented on GitHub (Jan 2, 2024): Kernel is 6+ and the setup is supported. I was able to get PyTorch working with CUDA - albeit PyTorch 2.0.1 only since that is the last version that supports CUDA 11.4 The error 209 "no kernel image is available for execution on the device" is for CUDA, not the Linux kernel. Basically the Ollama distribution doesn't have a compiled kernel (via nvcc) for CUDA 11.4 (not even sure if that is supported, if I build from source).
Author
Owner

@yolobnb commented on GitHub (Jan 8, 2024):

This is also the same case for me. I am using a Quadro K2200. It is recognized along with the computing capability. As soon as I pull a model, the error shows up and Ollama terminates.

<!-- gh-comment-id:1880241126 --> @yolobnb commented on GitHub (Jan 8, 2024): This is also the same case for me. I am using a Quadro K2200. It is recognized along with the computing capability. As soon as I pull a model, the error shows up and Ollama terminates.
Author
Owner

@dhiltgen commented on GitHub (Jan 9, 2024):

The K80 is Compute Capability 3.7, which at present isn't supported by our CUDA builds. (see https://developer.nvidia.com/cuda-gpus for the mapping table)

Based on our current build setup, Compute Capability 6.0 is the minimum we'll support. We had some bugs on detection and fallback logic in 0.1.18, which should be resolved in 0.1.19 so that if we detect older than 6.0 we'll fallback to CPU.

There's a possibility we may be able to support 5.x cards by compiling llama.cpp with different flags and dynamically loading the right library variant on the fly based on what we discover, but that support hasn't been merged yet.

I'm not sure yet if we can compile support going all the way back into the 3.7 series, but we'll keep this ticket tracking that.

<!-- gh-comment-id:1883669724 --> @dhiltgen commented on GitHub (Jan 9, 2024): The K80 is Compute Capability 3.7, which at present isn't supported by our CUDA builds. (see https://developer.nvidia.com/cuda-gpus for the mapping table) Based on our current build setup, Compute Capability 6.0 is the minimum we'll support. We had some bugs on detection and fallback logic in 0.1.18, which should be resolved in 0.1.19 so that if we detect older than 6.0 we'll fallback to CPU. There's a possibility we may be able to support 5.x cards by compiling llama.cpp with different flags and dynamically loading the right library variant on the fly based on what we discover, but that support hasn't been merged yet. I'm not sure yet if we can compile support going all the way back into the 3.7 series, but we'll keep this ticket tracking that.
Author
Owner

@datag commented on GitHub (Jan 10, 2024):

I'd love to see that change. Owner of old GeForce GTX 960M on amd64 Linux here. Version 0.1.18 stopped working while 0.1.17 has been working.

<!-- gh-comment-id:1885706758 --> @datag commented on GitHub (Jan 10, 2024): I'd love to see that change. Owner of old `GeForce GTX 960M` on amd64 Linux here. Version 0.1.18 stopped working while 0.1.17 has been working.
Author
Owner

@dhiltgen commented on GitHub (Jan 10, 2024):

I'd love to see that change. Owner of old GeForce GTX 960M on amd64 Linux here. Version 0.1.18 stopped working while 0.1.17 has been working.

Can you clarify? Was 0.1.17 working on the GPU, or falling back to CPU mode?

Also to clarify, the GTX 960M is a Compute Capability 5.0 card, which we're tracking in a different ticket now #1865

<!-- gh-comment-id:1885724594 --> @dhiltgen commented on GitHub (Jan 10, 2024): > I'd love to see that change. Owner of old GeForce GTX 960M on amd64 Linux here. Version 0.1.18 stopped working while 0.1.17 has been working. Can you clarify? Was 0.1.17 working on the GPU, or falling back to CPU mode? Also to clarify, the [GTX 960M is a Compute Capability 5.0](https://developer.nvidia.com/cuda-gpus) card, which we're tracking in a different ticket now #1865
Author
Owner

@datag commented on GitHub (Jan 10, 2024):

I'd love to see that change. Owner of old GeForce GTX 960M on amd64 Linux here. Version 0.1.18 stopped working while 0.1.17 has been working.

Can you clarify? Was 0.1.17 working on the GPU, or falling back to CPU mode?

Also to clarify, the GTX 960M is a Compute Capability 5.0 card, which we're tracking in a different ticket now #1865

You're right, I guess it was falling back to CPU mode, but I'm unsure how to read the logs correctly.

The issue you mentioned seems to be the issue I was having. Version 0.1.19 fixes it. Sorry for the noise and thanks!

<!-- gh-comment-id:1885749197 --> @datag commented on GitHub (Jan 10, 2024): > > I'd love to see that change. Owner of old GeForce GTX 960M on amd64 Linux here. Version 0.1.18 stopped working while 0.1.17 has been working. > > Can you clarify? Was 0.1.17 working on the GPU, or falling back to CPU mode? > > Also to clarify, the [GTX 960M is a Compute Capability 5.0](https://developer.nvidia.com/cuda-gpus) card, which we're tracking in a different ticket now #1865 You're right, I guess it was falling back to CPU mode, but I'm unsure how to read the logs correctly. The issue you mentioned seems to be the issue I was having. Version 0.1.19 fixes it. Sorry for the noise and thanks!
Author
Owner

@dhiltgen commented on GitHub (Jan 10, 2024):

but I'm unsure how to read the logs correctly.

At startup the server log will report information about attempting to discover GPU information, and in the case of CUDA cards, will report the compute capability. If we don't detect a supported GPU, we report that we're falling back to CPU mode. In the near future we'll be adding refinements to support multiple variants for a given GPU (and CPU) to try to leverage modern capabilities when detected, but also be able to fallback to a baseline that works for older GPUs/CPUs.

<!-- gh-comment-id:1885780686 --> @dhiltgen commented on GitHub (Jan 10, 2024): > but I'm unsure how to read the logs correctly. At startup the server log will report information about attempting to discover GPU information, and in the case of CUDA cards, will report the compute capability. If we don't detect a supported GPU, we report that we're falling back to CPU mode. In the near future we'll be adding refinements to support multiple variants for a given GPU (and CPU) to try to leverage modern capabilities when detected, but also be able to fallback to a baseline that works for older GPUs/CPUs.
Author
Owner

@nejib1 commented on GitHub (Jan 15, 2024):

Hello, same case here, I have Nvidia K80, ollama works only in CPU :(

<!-- gh-comment-id:1892753593 --> @nejib1 commented on GitHub (Jan 15, 2024): Hello, same case here, I have Nvidia K80, ollama works only in CPU :(
Author
Owner

@sunzh231 commented on GitHub (Jan 16, 2024):

Hi, same case here, I have Nvidia M40, ollama works only in CPU in docker container :(

<!-- gh-comment-id:1893788449 --> @sunzh231 commented on GitHub (Jan 16, 2024): Hi, same case here, I have Nvidia M40, ollama works only in CPU in docker container :(
Author
Owner

@dhiltgen commented on GitHub (Jan 18, 2024):

Hi, same case here, I have Nvidia M40, ollama works only in CPU in docker container :(

The M40 is a Compute Capability 5.2 card, so it's covered by #1865

<!-- gh-comment-id:1898860297 --> @dhiltgen commented on GitHub (Jan 18, 2024): > Hi, same case here, I have Nvidia M40, ollama works only in CPU in docker container :( The [M40](https://developer.nvidia.com/cuda-gpus) is a Compute Capability 5.2 card, so it's covered by #1865
Author
Owner

@dhiltgen commented on GitHub (Jan 20, 2024):

We're using CUDA v11 to compile our official builds. Digging around a bit, it looks like CUDA v11 no longer supports Compute Capability 3.0, but I am able to get nvcc to target 3.5 cards.

I'll work on some mod's to the way we do our builds so that someone with a 3.0 card and older CUDA toolkit might be able to build it on their own from source, but I think we may be able to get 3.5+ support into the official builds.

<!-- gh-comment-id:1902254857 --> @dhiltgen commented on GitHub (Jan 20, 2024): We're using CUDA v11 to compile our official builds. Digging around a bit, it looks like CUDA v11 no longer supports Compute Capability 3.0, but I am able to get nvcc to target 3.5 cards. I'll work on some mod's to the way we do our builds so that someone with a 3.0 card and older CUDA toolkit might be able to build it on their own from source, but I think we may be able to get 3.5+ support into the official builds.
Author
Owner

@orlyandico commented on GitHub (Jan 20, 2024):

The K80 I referenced in my original post supports up to CUDA 11.4 which is
the last version it will ever support, since it has been end-of-lifed.

On Sat, Jan 20, 2024 at 8:11 PM Daniel Hiltgen @.***>
wrote:

We're using CUDA v11 to compile our official builds. Digging around a bit,
it looks like CUDA v11 no longer supports Compute Capability 3.0, but I am
able to get nvcc to target 3.5 cards.

I'll work on some mod's to the way we do our builds so that someone with a
3.0 card and older CUDA toolkit might be able to build it on their own from
source, but I think we may be able to get 3.5+ support into the official
builds.


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1756#issuecomment-1902254857,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKDS3HJBUZE6MFWHDQJWYDYPQQGDAVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGI2TIOBVG4
.
You are receiving this because you authored the thread.Message ID:
@.***>

<!-- gh-comment-id:1902263608 --> @orlyandico commented on GitHub (Jan 20, 2024): The K80 I referenced in my original post supports up to CUDA 11.4 which is the last version it will ever support, since it has been end-of-lifed. On Sat, Jan 20, 2024 at 8:11 PM Daniel Hiltgen ***@***.***> wrote: > We're using CUDA v11 to compile our official builds. Digging around a bit, > it looks like CUDA v11 no longer supports Compute Capability 3.0, but I am > able to get nvcc to target 3.5 cards. > > I'll work on some mod's to the way we do our builds so that someone with a > 3.0 card and older CUDA toolkit might be able to build it on their own from > source, but I think we may be able to get 3.5+ support into the official > builds. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1756#issuecomment-1902254857>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAKDS3HJBUZE6MFWHDQJWYDYPQQGDAVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGI2TIOBVG4> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >
Author
Owner

@dhiltgen commented on GitHub (Jan 20, 2024):

PR #2116 lays foundation to be able to experiment with CC 3.5 support. I'm not sure if we'll need other flags to get it working, or simply adding "35" to the list of CMAKE_CUDA_ARCHITECTURES.

<!-- gh-comment-id:1902289522 --> @dhiltgen commented on GitHub (Jan 20, 2024): PR #2116 lays foundation to be able to experiment with CC 3.5 support. I'm not sure if we'll need other flags to get it working, or simply adding "35" to the list of `CMAKE_CUDA_ARCHITECTURES`.
Author
Owner

@orlyandico commented on GitHub (Jan 25, 2024):

EDIT: I am aware that there are "resizable BAR" issues around the use of the Tesla P40, and my hardware is so ancient that it does not support resizable BAR. However, PyTorch runs just fine and I can load e.g. BigBird into the P40 and do inference. Note that my PyTorch install is 2.0.1 and also worked on the K80. PyTorch itself warns that the GT730 (CC 3.5) is not supported, and CC 3.7 is the lowest supported on 2.0.1 (which is a few years old at this point).


I replaced the K80 with a P40, which is a Compute Capability 6.1 card. The card appears in nvidia-smi and is detected in the Ollama logs:

...
Jan 25 15:26:21 thinkstation-s30 ollama[919]: ggml_init_cublas: found 2 CUDA devices:
Jan 25 15:26:21 thinkstation-s30 ollama[919]: Device 0: Tesla P40, compute capability 6.1
Jan 25 15:26:21 thinkstation-s30 ollama[919]: Device 1: NVIDIA GeForce GT 730, compute capability 3.5
...

However I still get the "... no kernel..." error, it appears to be using Device 1! it's not very clear how to force the use of Device 0 (when I was using the K80 it was being properly selected) - I tried the CUDA_VISIBLE_DEVICES environment variable which had no effect.

...
Jan 25 15:26:26 thinkstation-s30 ollama[919]: llama_new_context_with_model: total VRAM used: 2258.20 MiB (model: 1456.19 MiB, context: 802.00 MiB)
Jan 25 15:26:26 thinkstation-s30 ollama[919]: CUDA error 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: no kernel image is available for execution on the device
Jan 25 15:26:26 thinkstation-s30 ollama[919]: current device: 1
Jan 25 15:26:26 thinkstation-s30 ollama[919]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: !"CUDA error"
Jan 25 15:26:27 thinkstation-s30 ollama[919]: 2024/01/25 15:26:27 llama.go:451: 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: no kernel image is available for execution on the device
Jan 25 15:26:27 thinkstation-s30 ollama[919]: current device: 1
Jan 25 15:26:27 thinkstation-s30 ollama[919]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: !"CUDA error"
...

<!-- gh-comment-id:1910469176 --> @orlyandico commented on GitHub (Jan 25, 2024): EDIT: I am aware that there are "resizable BAR" issues around the use of the Tesla P40, and my hardware is so ancient that it does not support resizable BAR. However, PyTorch runs just fine and I can load e.g. BigBird into the P40 and do inference. Note that my PyTorch install is 2.0.1 and also worked on the K80. PyTorch itself warns that the GT730 (CC 3.5) is not supported, and CC 3.7 is the lowest supported on 2.0.1 (which is a few years old at this point). --- I replaced the K80 with a P40, which is a Compute Capability 6.1 card. The card appears in nvidia-smi and is detected in the Ollama logs: ... Jan 25 15:26:21 thinkstation-s30 ollama[919]: ggml_init_cublas: found 2 CUDA devices: Jan 25 15:26:21 thinkstation-s30 ollama[919]: Device 0: Tesla P40, compute capability 6.1 Jan 25 15:26:21 thinkstation-s30 ollama[919]: Device 1: NVIDIA GeForce GT 730, compute capability 3.5 ... However I still get the "... no kernel..." error, it appears to be using Device 1! it's not very clear how to force the use of Device 0 (when I was using the K80 it was being properly selected) - I tried the CUDA_VISIBLE_DEVICES environment variable which had no effect. ... Jan 25 15:26:26 thinkstation-s30 ollama[919]: llama_new_context_with_model: total VRAM used: 2258.20 MiB (model: 1456.19 MiB, context: 802.00 MiB) Jan 25 15:26:26 thinkstation-s30 ollama[919]: CUDA error 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: no kernel image is available for execution on the device Jan 25 15:26:26 thinkstation-s30 ollama[919]: current device: 1 Jan 25 15:26:26 thinkstation-s30 ollama[919]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: !"CUDA error" Jan 25 15:26:27 thinkstation-s30 ollama[919]: 2024/01/25 15:26:27 llama.go:451: 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: no kernel image is available for execution on the device Jan 25 15:26:27 thinkstation-s30 ollama[919]: current device: 1 Jan 25 15:26:27 thinkstation-s30 ollama[919]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: !"CUDA error" ...
Author
Owner

@dhiltgen commented on GitHub (Jan 25, 2024):

@orlyandico that's unfortunate CUDA_VISIBLE_DEVICES didn't do the trick. I'll try to see if I can setup a test rig similar to your setup and try to find a way to ignore the unsupported card.

<!-- gh-comment-id:1910614602 --> @dhiltgen commented on GitHub (Jan 25, 2024): @orlyandico that's unfortunate CUDA_VISIBLE_DEVICES didn't do the trick. I'll try to see if I can setup a test rig similar to your setup and try to find a way to ignore the unsupported card.
Author
Owner

@orlyandico commented on GitHub (Jan 25, 2024):

I've also gotten DiffusionPipeline and models from HuggingFace working, it is a bit odd that

torch.cuda.device_count()

sometimes returns 1 (and only enumerates the P40) and sometimes 2 (also enumerates the GT730)

<!-- gh-comment-id:1910858890 --> @orlyandico commented on GitHub (Jan 25, 2024): I've also gotten DiffusionPipeline and models from HuggingFace working, it is a bit odd that torch.cuda.device_count() sometimes returns 1 (and only enumerates the P40) and sometimes 2 (also enumerates the GT730)
Author
Owner

@dhiltgen commented on GitHub (Jan 27, 2024):

I've got a PR up to add support, but I'm a little concerned people might actually see a performance hit not improvement by transitioning to GPU instead of CPU for these old cards.

Folks with these old cards - if you want to give the change a try and build from source and let me know how the performance compares before/after that would be helpful to weigh when/if we merge the PR.

<!-- gh-comment-id:1913299893 --> @dhiltgen commented on GitHub (Jan 27, 2024): I've got a PR up to add support, but I'm a little concerned people might actually see a performance hit not improvement by transitioning to GPU instead of CPU for these old cards. Folks with these old cards - if you want to give the change a try and build from source and let me know how the performance compares before/after that would be helpful to weigh when/if we merge the PR.
Author
Owner

@felipecock commented on GitHub (Jan 28, 2024):

Folks with these old cards - if you want to give the change a try and build from source and let me know how the performance compares before/after that would be helpful to weigh when/if we merge the PR.

Hi @dhiltgen

I have a GeForce 920M GPU which has a CC 3.5
I'd like to participate in that test, please guide me how could I compile it on Ubuntu 22.04 and how can I benchmark this test with and without the GPU.

I appreciate your contributions and appreciate your efforts to support these older GPUs.

<!-- gh-comment-id:1913478019 --> @felipecock commented on GitHub (Jan 28, 2024): > Folks with these old cards - if you want to give the change a try and build from source and let me know how the performance compares before/after that would be helpful to weigh when/if we merge the PR. Hi @dhiltgen I have a _GeForce 920M_ GPU which has a CC 3.5 I'd like to participate in that test, please guide me how could I compile it on Ubuntu 22.04 and how can I benchmark this test with and without the GPU. I appreciate your contributions and appreciate your efforts to support these older GPUs.
Author
Owner

@dhiltgen commented on GitHub (Jan 28, 2024):

Thanks @felipecock

Check out https://github.com/ollama/ollama/blob/main/docs/development.md for instructions, and if you get stuck, join the community on Discord for an added hand.

<!-- gh-comment-id:1913703377 --> @dhiltgen commented on GitHub (Jan 28, 2024): Thanks @felipecock Check out https://github.com/ollama/ollama/blob/main/docs/development.md for instructions, and if you get stuck, join the community on [Discord](https://discord.gg/ollama) for an added hand.
Author
Owner

@nejib1 commented on GitHub (Jan 28, 2024):

Hello @dhiltgen
Is there any possibility of getting Ollama to work with the Nvidia K80 in the next few days, or should we abandon this idea?

<!-- gh-comment-id:1913754795 --> @nejib1 commented on GitHub (Jan 28, 2024): Hello @dhiltgen Is there any possibility of getting Ollama to work with the Nvidia K80 in the next few days, or should we abandon this idea?
Author
Owner

@dhiltgen commented on GitHub (Jan 29, 2024):

@nejib1 if you apply the changes of my PR as a patch to the repo and build from source, it will run on a K80 GPU. Instructions on building from source are here

Given the concerns we have that this might actually result in a performance regression not improvement for users, we're going to hold off merging this until we get more performance data.

<!-- gh-comment-id:1913781785 --> @dhiltgen commented on GitHub (Jan 29, 2024): @nejib1 if you apply the changes of my PR as a [patch](https://patch-diff.githubusercontent.com/raw/ollama/ollama/pull/2233.patch) to the repo and build from source, it will run on a K80 GPU. Instructions on building from source are [here](https://github.com/ollama/ollama/blob/main/docs/development.md) Given the concerns we have that this might actually result in a performance regression not improvement for users, we're going to hold off merging this until we get more performance data.
Author
Owner

@nejib1 commented on GitHub (Jan 29, 2024):

@nejib1 if you apply the changes of my PR as a patch to the repo and build from source, it will run on a K80 GPU. Instructions on building from source are here

Given the concerns we have that this might actually result in a performance regression not improvement for users, we're going to hold off merging this until we get more performance data.

Thank you very much, I'll try it

<!-- gh-comment-id:1913784581 --> @nejib1 commented on GitHub (Jan 29, 2024): > @nejib1 if you apply the changes of my PR as a [patch](https://patch-diff.githubusercontent.com/raw/ollama/ollama/pull/2233.patch) to the repo and build from source, it will run on a K80 GPU. Instructions on building from source are [here](https://github.com/ollama/ollama/blob/main/docs/development.md) > > Given the concerns we have that this might actually result in a performance regression not improvement for users, we're going to hold off merging this until we get more performance data. Thank you very much, I'll try it
Author
Owner

@tbendien commented on GitHub (Jan 29, 2024):

I am having similiar issues trying to run Ollama Web UI with my RTX A4000 16GB GPU.
When I run standard Ollama, it uses my GPU just fine. When I install Ollama Web UI, I get errors (from a full clean Ubuntu install, with all NVIDIA drivers and container toolkit installed).

Ollama Web UI commands

gtadmin@gtaiws3:~/ollama-webui$ docker-compose -f docker-compose.yaml -f docker-compose.gpu.yaml up
Traceback (most recent call last):
File "/usr/bin/docker-compose", line 33, in
sys.exit(load_entry_point('docker-compose==1.29.2', 'console_scripts', 'docker-compose')())
File "/usr/lib/python3/dist-packages/compose/cli/main.py", line 81, in main
command_func()
File "/usr/lib/python3/dist-packages/compose/cli/main.py", line 200, in perform_command
project = project_from_options('.', options)
File "/usr/lib/python3/dist-packages/compose/cli/command.py", line 60, in project_from_options
return get_project(
File "/usr/lib/python3/dist-packages/compose/cli/command.py", line 157, in get_project
return Project.from_config(
File "/usr/lib/python3/dist-packages/compose/project.py", line 135, in from_config
service_dict['device_requests'] = project.get_device_requests(service_dict)
File "/usr/lib/python3/dist-packages/compose/project.py", line 375, in get_device_requests
raise ConfigurationError(
TypeError: ConfigurationError.init() takes 2 positional arguments but 3 were given

When I just run the CPU only yaml, everything works fine....
gtadmin@gtaiws3:~/ollama-webui$ docker-compose -f docker-compose.yaml up
ollama is up-to-date
ollama-webui is up-to-date
Attaching to ollama, ollama-webui
ollama | Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
ollama | Your new public key is:
ollama |
ollama | ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEi4k2WvzJB4+o3PMQTvhq1M2ci6JnEYfDUiH6Dl6k+k
ollama |
ollama | 2024/01/29 02:09:40 images.go:857: INFO total blobs: 0
ollama | 2024/01/29 02:09:40 images.go:864: INFO total unused blobs removed: 0
ollama | 2024/01/29 02:09:40 routes.go:950: INFO Listening on [::]:11434 (version 0.1.22)
ollama | 2024/01/29 02:09:40 payload_common.go:106: INFO Extracting dynamic libraries...
ollama | 2024/01/29 02:09:42 payload_common.go:145: INFO Dynamic LLM libraries [cpu cuda_v11 cpu_avx rocm_v5 rocm_v6 cpu_avx2]
ollama | 2024/01/29 02:09:42 gpu.go:94: INFO Detecting GPU type
ollama | 2024/01/29 02:09:42 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so
ollama | 2024/01/29 02:09:42 gpu.go:282: INFO Discovered GPU libraries: []
ollama | 2024/01/29 02:09:42 gpu.go:236: INFO Searching for GPU management library librocm_smi64.so
ollama | 2024/01/29 02:09:42 gpu.go:282: INFO Discovered GPU libraries: []
ollama | 2024/01/29 02:09:42 cpu_common.go:11: INFO CPU has AVX2
ollama | 2024/01/29 02:09:42 routes.go:973: INFO no GPU detected
ollama-webui | start.sh: 3: Bad substitution
ollama-webui | INFO: Started server process [1]
ollama-webui | INFO: Waiting for application startup.
ollama-webui | INFO: Application startup complete.
ollama-webui | INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

<!-- gh-comment-id:1913859921 --> @tbendien commented on GitHub (Jan 29, 2024): I am having similiar issues trying to run Ollama Web UI with my RTX A4000 16GB GPU. When I run standard Ollama, it uses my GPU just fine. When I install Ollama Web UI, I get errors (from a full clean Ubuntu install, with all NVIDIA drivers and container toolkit installed). **Ollama Web UI** commands gtadmin@gtaiws3:~/ollama-webui$ docker-compose -f docker-compose.yaml -f docker-compose.gpu.yaml up Traceback (most recent call last): File "/usr/bin/docker-compose", line 33, in <module> sys.exit(load_entry_point('docker-compose==1.29.2', 'console_scripts', 'docker-compose')()) File "/usr/lib/python3/dist-packages/compose/cli/main.py", line 81, in main command_func() File "/usr/lib/python3/dist-packages/compose/cli/main.py", line 200, in perform_command project = project_from_options('.', options) File "/usr/lib/python3/dist-packages/compose/cli/command.py", line 60, in project_from_options return get_project( File "/usr/lib/python3/dist-packages/compose/cli/command.py", line 157, in get_project return Project.from_config( File "/usr/lib/python3/dist-packages/compose/project.py", line 135, in from_config service_dict['device_requests'] = project.get_device_requests(service_dict) File "/usr/lib/python3/dist-packages/compose/project.py", line 375, in get_device_requests raise ConfigurationError( TypeError: ConfigurationError.__init__() takes 2 positional arguments but 3 were given **When I just run the CPU only yaml, everything works fine....** gtadmin@gtaiws3:~/ollama-webui$ docker-compose -f docker-compose.yaml up ollama is up-to-date ollama-webui is up-to-date Attaching to ollama, ollama-webui ollama | Couldn't find '/root/.ollama/id_ed25519'. Generating new private key. ollama | Your new public key is: ollama | ollama | ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEi4k2WvzJB4+o3PMQTvhq1M2ci6JnEYfDUiH6Dl6k+k ollama | ollama | 2024/01/29 02:09:40 images.go:857: INFO total blobs: 0 ollama | 2024/01/29 02:09:40 images.go:864: INFO total unused blobs removed: 0 ollama | 2024/01/29 02:09:40 routes.go:950: INFO Listening on [::]:11434 (version 0.1.22) ollama | 2024/01/29 02:09:40 payload_common.go:106: INFO Extracting dynamic libraries... ollama | 2024/01/29 02:09:42 payload_common.go:145: INFO Dynamic LLM libraries [cpu cuda_v11 cpu_avx rocm_v5 rocm_v6 cpu_avx2] ollama | 2024/01/29 02:09:42 gpu.go:94: INFO Detecting GPU type ollama | 2024/01/29 02:09:42 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so ollama | 2024/01/29 02:09:42 gpu.go:282: INFO Discovered GPU libraries: [] ollama | 2024/01/29 02:09:42 gpu.go:236: INFO Searching for GPU management library librocm_smi64.so ollama | 2024/01/29 02:09:42 gpu.go:282: INFO Discovered GPU libraries: [] ollama | 2024/01/29 02:09:42 cpu_common.go:11: INFO CPU has AVX2 ollama | 2024/01/29 02:09:42 routes.go:973: INFO no GPU detected ollama-webui | start.sh: 3: Bad substitution ollama-webui | INFO: Started server process [1] ollama-webui | INFO: Waiting for application startup. ollama-webui | INFO: Application startup complete. ollama-webui | INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
Author
Owner

@dhiltgen commented on GitHub (Jan 29, 2024):

@tbendien an RTX A4000 is a modern GPU with Compute Capability 8.6. Let's keep this ticket focused on support for much older cards with CC 3.5 and 3.7. Folks can help troubleshoot on Discord, or you can open a new issue.

<!-- gh-comment-id:1915098027 --> @dhiltgen commented on GitHub (Jan 29, 2024): @tbendien an RTX A4000 is a modern GPU with [Compute Capability 8.6](https://developer.nvidia.com/cuda-gpus). Let's keep this ticket focused on support for much older cards with CC 3.5 and 3.7. Folks can help troubleshoot on [Discord](https://discord.gg/ollama), or you can open a new issue.
Author
Owner

@felipecock commented on GitHub (Jan 31, 2024):

@dhiltgen
I've perfomed a test with and without the GPU:

Only CPU Intel Core i7 5500U CPU - ollama:main branch (time seems to be in ns)

{"model":"llama2:latest","created_at":"2024-01-31T22:24:33.848173925Z","message":{"role":"assistant","content":""},"done":true,"total_duration":330940056957,"load_duration":3067744651,"prompt_eval_count":457,"prompt_eval_duration":227370727000,"eval_count":157,"eval_duration":100501014000}

GPU GeForce 920M @ 4GB (It only reached up to 33% GPU about the first minute, Dedicated Memory doesn't seem to be used) + CPU Intel Core i7 5500U CPU (Reached 100% most of time) - ollama:cc_3.5 branch

llama_print_timings: load time = 2001.26 ms
llama_print_timings: sample time = 168.67 ms / 175 runs ( 0.96 ms per token, 1037.54 tokens per second)
llama_print_timings: prompt eval time = 110295.28 ms / 154 tokens ( 716.20 ms per token, 1.40 tokens per second)
llama_print_timings: eval time = 198530.10 ms / 174 runs ( 1140.98 ms per token, 0.88 tokens per second)
llama_print_timings: total time = 309092.12 ms

It was a bit faster with GPU, although it was not used at 100% as I expected, IDK if that is ok for this model.

<!-- gh-comment-id:1920142180 --> @felipecock commented on GitHub (Jan 31, 2024): @dhiltgen I've perfomed a test with and without the GPU: ## Only CPU Intel Core i7 5500U CPU - ollama:main branch (time seems to be in ns) {"model":"llama2:latest","created_at":"2024-01-31T22:24:33.848173925Z","message":{"role":"assistant","content":""},"done":true,"total_duration":330940056957,"load_duration":3067744651,"prompt_eval_count":457,"prompt_eval_duration":227370727000,"eval_count":157,"eval_duration":100501014000} ## GPU GeForce 920M @ 4GB (It only reached up to 33% GPU about the first minute, Dedicated Memory doesn't seem to be used) + CPU Intel Core i7 5500U CPU (Reached 100% most of time) - [ollama:cc_3.5](https://github.com/dhiltgen/ollama/tree/cc_35) branch llama_print_timings: load time = 2001.26 ms llama_print_timings: sample time = 168.67 ms / 175 runs ( 0.96 ms per token, 1037.54 tokens per second) llama_print_timings: prompt eval time = 110295.28 ms / 154 tokens ( 716.20 ms per token, 1.40 tokens per second) llama_print_timings: eval time = 198530.10 ms / 174 runs ( 1140.98 ms per token, 0.88 tokens per second) llama_print_timings: total time = 309092.12 ms It was a bit faster with GPU, although it was not used at 100% as I expected, IDK if that is ok for this model.
Author
Owner

@orlyandico commented on GitHub (Feb 1, 2024):

@orlyandico that's unfortunate CUDA_VISIBLE_DEVICES didn't do the trick. I'll try to see if I can setup a test rig similar to your setup and try to find a way to ignore the unsupported card.

Found the reason, ollama.service was launching from systemd and so wasn't picking up CUDA_VISIBLE_DEVICES from the environment.

Still leaves the question as to why the CC 3.5 device was being selected when it isn't the first device and is not supported. Ollama probably should have logic to select only the supported CUDA devices on a multi-device host..

<!-- gh-comment-id:1921570904 --> @orlyandico commented on GitHub (Feb 1, 2024): > @orlyandico that's unfortunate CUDA_VISIBLE_DEVICES didn't do the trick. I'll try to see if I can setup a test rig similar to your setup and try to find a way to ignore the unsupported card. Found the reason, ollama.service was launching from systemd and so wasn't picking up CUDA_VISIBLE_DEVICES from the environment. Still leaves the question as to why the CC 3.5 device was being selected when it isn't the first device and is not supported. Ollama probably should have logic to select only the supported CUDA devices on a multi-device host..
Author
Owner

@dhiltgen commented on GitHub (Feb 1, 2024):

@orlyandico we don't yet have logic to automatically detect and bypass unsupported cards in a multi-gpu setup when one isn't supported but others are.

@felipecock can you clarify your scenario? Are you attempting to load a model that can't fit entirely in VRAM and thus are getting a split between CPU/GPU? For apples-to-apples performance comparison, I'd try to get metrics from a model that fits entirely in the GPU so we're not getting thrown off by I/O bottlenecks or GPU stalling waiting for CPU.

<!-- gh-comment-id:1921788676 --> @dhiltgen commented on GitHub (Feb 1, 2024): @orlyandico we don't yet have logic to automatically detect and bypass unsupported cards in a multi-gpu setup when one isn't supported but others are. @felipecock can you clarify your scenario? Are you attempting to load a model that can't fit entirely in VRAM and thus are getting a split between CPU/GPU? For apples-to-apples performance comparison, I'd try to get metrics from a model that fits entirely in the GPU so we're not getting thrown off by I/O bottlenecks or GPU stalling waiting for CPU.
Author
Owner

@felipecock commented on GitHub (Feb 2, 2024):

@dhiltgen, I've performed a test in a newer machine (13th Gen Intel(R) Core(TM) i9-13900H, 2600 Mhz, 14 Core(s), 20 Logical Processor(s) 64GB RAM + NVIDIA RTX 2000 Ada Generation Laptop GPU) and I realized that CPU is used in a more extensive way than the GPU, despite Ollama said the GPU was to be used:

gpu.go:88: Detecting GPU type
gpu.go:203: Searching for GPU management library libnvidia-ml.so
gpu.go:248: Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1]
gpu.go:94: Nvidia GPU detected
gpu.go:135: CUDA Compute Capability detected: 8.9
...
shim_ext_server_linux.go:24: Updating PATH to /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/ollama1622992116/cuda
shim_ext_server.go:92: Loading Dynamic Shim llm server: /tmp/ollama1622992116/cuda/libext_server.so
ext_server_common.go:136: Initializing internal llama server
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation Laptop GPU, compute capability 8.9
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2)

So, I think that the expected behavior is to use the CPU in some part of the process, that cannot be parallelized (I believe), and it results in a wider CPU usage rather of the GPU.

I'm not an expert in this, then I could be wrong. 😕

<!-- gh-comment-id:1924310903 --> @felipecock commented on GitHub (Feb 2, 2024): @dhiltgen, I've performed a test in a newer machine (13th Gen Intel(R) Core(TM) i9-13900H, 2600 Mhz, 14 Core(s), 20 Logical Processor(s) 64GB RAM + NVIDIA RTX 2000 Ada Generation Laptop GPU) and I realized that CPU is used in a more extensive way than the GPU, despite Ollama said the GPU was to be used: ``` gpu.go:88: Detecting GPU type gpu.go:203: Searching for GPU management library libnvidia-ml.so gpu.go:248: Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1] gpu.go:94: Nvidia GPU detected gpu.go:135: CUDA Compute Capability detected: 8.9 ... shim_ext_server_linux.go:24: Updating PATH to /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/ollama1622992116/cuda shim_ext_server.go:92: Loading Dynamic Shim llm server: /tmp/ollama1622992116/cuda/libext_server.so ext_server_common.go:136: Initializing internal llama server ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA RTX 2000 Ada Generation Laptop GPU, compute capability 8.9 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2) ``` So, I think that the expected behavior is to use the CPU in some part of the process, that cannot be parallelized (I believe), and it results in a wider CPU usage rather of the GPU. I'm not an expert in this, then I could be wrong. :confused:
Author
Owner

@dhiltgen commented on GitHub (Feb 2, 2024):

@felipecock I'm not quite sure what your question is. It looks like that GPU has 12G of VRAM, so you'll be able to run larger models entirely on the GPU than a typical CC 3.5 or 3.7 card. We're drifting a bit off-topic for this issue, but if the model doesn't fit in VRAM, then some amount of processing is done on the CPU, and often this can result in poor performance as the GPU stalls waiting for the CPU to keep up.

The current state of this issue is I have a PR up which would enable support for these older cards, but we're not sure if we're going to merge it yet or not, as we're concerned it could be a performance hit for many users given these older cards aren't particularly well suited for LLM work.

<!-- gh-comment-id:1924734377 --> @dhiltgen commented on GitHub (Feb 2, 2024): @felipecock I'm not quite sure what your question is. It looks like that GPU has 12G of VRAM, so you'll be able to run larger models entirely on the GPU than a typical CC 3.5 or 3.7 card. We're drifting a bit off-topic for this issue, but if the model doesn't fit in VRAM, then some amount of processing is done on the CPU, and often this can result in poor performance as the GPU stalls waiting for the CPU to keep up. The current state of this issue is I have a PR up which would enable support for these older cards, but we're not sure if we're going to merge it yet or not, as we're concerned it could be a performance hit for many users given these older cards aren't particularly well suited for LLM work.
Author
Owner

@orlyandico commented on GitHub (Feb 2, 2024):

I gave up on the K80 and got a P40 because.. even though the K80 is 2 x 12GB, it
doesn't support smaller data types! you're stuck with fp32 and
fp64, even a 7B model won't fit in the 12GB of RAM!

On Fri, Feb 2, 2024 at 9:36 PM Daniel Hiltgen @.***>
wrote:

@felipecock https://github.com/felipecock I'm not quite sure what your
question is. It looks like that GPU has 12G of VRAM, so you'll be able to
run larger models entirely on the GPU than a typical CC 3.5 or 3.7 card.
We're drifting a bit off-topic for this issue, but if the model doesn't fit
in VRAM, then some amount of processing is done on the CPU, and often this
can result in poor performance as the GPU stalls waiting for the CPU to
keep up.

The current state of this issue is I have a PR up which would enable
support for these older cards, but we're not sure if we're going to merge
it yet or not, as we're concerned it could be a performance hit for many
users given these older cards aren't particularly well suited for LLM work.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/1756#issuecomment-1924734377,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKDS3A2SGHV56X7YQHBEZLYRVL7LAVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUG4ZTIMZXG4
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:1924749637 --> @orlyandico commented on GitHub (Feb 2, 2024): I gave up on the K80 and got a P40 because.. even though the K80 is 2 x 12GB, it doesn't support smaller data types! you're stuck with fp32 and fp64, even a 7B model won't fit in the 12GB of RAM! On Fri, Feb 2, 2024 at 9:36 PM Daniel Hiltgen ***@***.***> wrote: > @felipecock <https://github.com/felipecock> I'm not quite sure what your > question is. It looks like that GPU has 12G of VRAM, so you'll be able to > run larger models entirely on the GPU than a typical CC 3.5 or 3.7 card. > We're drifting a bit off-topic for this issue, but if the model doesn't fit > in VRAM, then some amount of processing is done on the CPU, and often this > can result in poor performance as the GPU stalls waiting for the CPU to > keep up. > > The current state of this issue is I have a PR up which would enable > support for these older cards, but we're not sure if we're going to merge > it yet or not, as we're concerned it could be a performance hit for many > users given these older cards aren't particularly well suited for LLM work. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/1756#issuecomment-1924734377>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAKDS3A2SGHV56X7YQHBEZLYRVL7LAVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUG4ZTIMZXG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@felipecock commented on GitHub (Feb 3, 2024):

Thank you for your reply, that super laptop is not mine, my laptop is the
Intel 5500 with GeForce 920M that has a CC 3.5.

I've just did the test in that "super" laptop to validate if even with a
"supported" CUDA GPU some high-demand processes are performed in CPU, and
it does.
your help
I've published my results, but if you want to guide me to perform a better
test, out some specific scenarios to test, I'll be happy to do so.

Thank you again for your time and gentleness!

On Fri, Feb 2, 2024, 4:36 PM Daniel Hiltgen @.***>
wrote:

@felipecock https://github.com/felipecock I'm not quite sure what your
question is. It looks like that GPU has 12G of VRAM, so you'll be able to
run larger models entirely on the GPU than a typical CC 3.5 or 3.7 card.
We're drifting a bit off-topic for this issue, but if the model doesn't fit
in VRAM, then some amount of processing is done on the CPU, and often this
can result in poor performance as the GPU stalls waiting for the CPU to
keep up.

The current state of this issue is I have a PR up which would enable
support for these older cards, but we're not sure if we're going to merge
it yet or not, as we're concerned it could be a performance hit for many
users given these older cards aren't particularly well suited for LLM work.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/1756#issuecomment-1924734377,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGJMD3VGQLEWCN3K6HWOSTTYRVL7NAVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUG4ZTIMZXG4
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:1925097095 --> @felipecock commented on GitHub (Feb 3, 2024): Thank you for your reply, that super laptop is not mine, my laptop is the Intel 5500 with GeForce 920M that has a CC 3.5. I've just did the test in that "super" laptop to validate if even with a "supported" CUDA GPU some high-demand processes are performed in CPU, and it does. your help I've published my results, but if you want to guide me to perform a better test, out some specific scenarios to test, I'll be happy to do so. Thank you again for your time and gentleness! On Fri, Feb 2, 2024, 4:36 PM Daniel Hiltgen ***@***.***> wrote: > @felipecock <https://github.com/felipecock> I'm not quite sure what your > question is. It looks like that GPU has 12G of VRAM, so you'll be able to > run larger models entirely on the GPU than a typical CC 3.5 or 3.7 card. > We're drifting a bit off-topic for this issue, but if the model doesn't fit > in VRAM, then some amount of processing is done on the CPU, and often this > can result in poor performance as the GPU stalls waiting for the CPU to > keep up. > > The current state of this issue is I have a PR up which would enable > support for these older cards, but we're not sure if we're going to merge > it yet or not, as we're concerned it could be a performance hit for many > users given these older cards aren't particularly well suited for LLM work. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/1756#issuecomment-1924734377>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGJMD3VGQLEWCN3K6HWOSTTYRVL7NAVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUG4ZTIMZXG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@orlyandico commented on GitHub (Feb 3, 2024):

@felipecock if the model doesn't fit entirely in the GPU RAM, then only some of the layers are stored and evaluated on the GPU, the rest on the CPU. So even if some layers evaluate faster on the GPU, the inference stalls waiting for the CPU, and you'll get barely-better-than-CPU performance.

The GeForce 920M only has 2GB of RAM, so only the tiniest of models would fit entirely on it. Actually, NONE of the existing models in the model library would fit entirely.

I just tried "dolphin-phi" which is 1.6GB but it consumes 2.5GB of GPU memory on my machine.

I did a quick search on HuggingFace for a sub-1GB GGUF model that can be imported, but found nothing.

Even if the entire model fits in the GPU, there is still some activity on the CPU. I just tested a specific LLM on Ollama and while it is generating, the CPU usage is 100% and the GPU usage is 100% - but my machine has 6 cores, so only 1 CPU core is being used. I believe this is copying data to and from the GPU memory. It is not actually doing inference on the CPU.

I have observed that when doing pure CPU inference, ALL of the CPU cores are used (e.g. I see 600% CPU) so in a mixed setup where some layers are on the GPU and other layers on the CPU, if Ollama uses all available CPU cores (it does not if all layers are on the GPU) then there might still be some benefit to offloading some layers to the GPU.

<!-- gh-comment-id:1925102812 --> @orlyandico commented on GitHub (Feb 3, 2024): @felipecock if the model doesn't fit entirely in the GPU RAM, then only some of the layers are stored and evaluated on the GPU, the rest on the CPU. So even if some layers evaluate faster on the GPU, the inference stalls waiting for the CPU, and you'll get barely-better-than-CPU performance. The GeForce 920M only has 2GB of RAM, so only the tiniest of models would fit entirely on it. Actually, *NONE* of the existing models in the model library would fit entirely. I just tried "dolphin-phi" which is 1.6GB but it consumes 2.5GB of GPU memory on my machine. I did a quick search on HuggingFace for a sub-1GB GGUF model that can be imported, but found nothing. Even if the entire model fits in the GPU, there is still some activity on the CPU. I just tested a specific LLM on Ollama and while it is generating, the CPU usage is 100% and the GPU usage is 100% - but my machine has 6 cores, so only 1 CPU core is being used. I believe this is copying data to and from the GPU memory. It is not actually doing inference on the CPU. I have observed that when doing pure CPU inference, *ALL* of the CPU cores are used (e.g. I see 600% CPU) so in a mixed setup where some layers are on the GPU and other layers on the CPU, if Ollama uses all available CPU cores (it does not if all layers are on the GPU) then there might still be some benefit to offloading some layers to the GPU.
Author
Owner

@felipecock commented on GitHub (Feb 7, 2024):

Thank you, @orlyandico, for your reply.

<!-- gh-comment-id:1932576363 --> @felipecock commented on GitHub (Feb 7, 2024): Thank you, @orlyandico, for your reply.
Author
Owner

@j-d-salinger commented on GitHub (Feb 29, 2024):

@nejib1 if you apply the changes of my PR as a patch to the repo and build from source, it will run on a K80 GPU. Instructions on building from source are here

Given the concerns we have that this might actually result in a performance regression not improvement for users, we're going to hold off merging this until we get more performance data.

I'm willing to test the K80... I have 2 of them on one machine (= 4 gpus total, since the k80 is two 12gb gpus stuck together.) I have 128 gb of ram on the machine.

Trying it on the mistral, llama2, and dolphin-phi models... I've successfully merged your patch and built it, but I can't quite tell if it is using the GPU? at least nvidia-smi doesn't show that it's using it. It does detect it in the logs: INFO: Cuda Compute 3.7 detected. But CPU is using 1,500% in top...

Correct me if I'm wrong, but even "run"-ing the model can be benefitted by the GPU? Last time I worked with A.I., I only ran (evaluated) models on CPU, and would only use the GPU for training. Right now I'm only testing it on "run" (chat) and it is showing CPU-only

<!-- gh-comment-id:1971707552 --> @j-d-salinger commented on GitHub (Feb 29, 2024): > @nejib1 if you apply the changes of my PR as a [patch](https://patch-diff.githubusercontent.com/raw/ollama/ollama/pull/2233.patch) to the repo and build from source, it will run on a K80 GPU. Instructions on building from source are [here](https://github.com/ollama/ollama/blob/main/docs/development.md) > > Given the concerns we have that this might actually result in a performance regression not improvement for users, we're going to hold off merging this until we get more performance data. I'm willing to test the K80... I have 2 of them on one machine (= 4 gpus total, since the k80 is two 12gb gpus stuck together.) I have 128 gb of ram on the machine. Trying it on the mistral, llama2, and dolphin-phi models... I've successfully merged your patch and built it, but I can't quite tell if it is using the GPU? at least `nvidia-smi` doesn't show that it's using it. It does detect it in the logs: INFO: Cuda Compute 3.7 detected. But CPU is using 1,500% in `top`... Correct me if I'm wrong, but even "run"-ing the model can be benefitted by the GPU? Last time I worked with A.I., I only ran (evaluated) models on CPU, and would only use the GPU for training. Right now I'm only testing it on "run" (chat) and it is showing CPU-only
Author
Owner

@nejib1 commented on GitHub (Feb 29, 2024):

Trying it on the mistral model... I've successfully merged your patch and built it, but I can't quite tell if it is using the GPU? at least nvidia-smi doesn't show that it's using it. It does detect it in the logs: INFO: Cuda Compute 3.7 detected. But CPU is using 1,500% in top

This linux command refresh nvidia-smi continuously, you can run something an check the % of GPU :

watch -n 1 nvidia-smi

<!-- gh-comment-id:1971737863 --> @nejib1 commented on GitHub (Feb 29, 2024): > Trying it on the mistral model... I've successfully merged your patch and built it, but I can't quite tell if it is using the GPU? at least `nvidia-smi` doesn't show that it's using it. It does detect it in the logs: INFO: Cuda Compute 3.7 detected. But CPU is using 1,500% in `top` > This linux command refresh nvidia-smi continuously, you can run something an check the % of GPU : `watch -n 1 nvidia-smi`
Author
Owner

@j-d-salinger commented on GitHub (Feb 29, 2024):

Yes, it says "No running process found". All are at 0% utilization, 0 MB of memory used

<!-- gh-comment-id:1971748681 --> @j-d-salinger commented on GitHub (Feb 29, 2024): Yes, it says "No running process found". All are at 0% utilization, 0 MB of memory used
Author
Owner

@orlyandico commented on GitHub (Feb 29, 2024):

if cpu usage is 1500% and there is no GPU memory usage, then it’s not
running on the GPU.

On Thu, 29 Feb 2024 at 18:47, j-d-salinger @.***> wrote:

Yes, it says "No running process found"


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/1756#issuecomment-1971748681,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKDS3A635XYYXYUH3ORCCLYV53U3AVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZRG42DQNRYGE
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:1971765164 --> @orlyandico commented on GitHub (Feb 29, 2024): if cpu usage is 1500% and there is no GPU memory usage, then it’s not running on the GPU. On Thu, 29 Feb 2024 at 18:47, j-d-salinger ***@***.***> wrote: > Yes, it says "No running process found" > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/1756#issuecomment-1971748681>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAKDS3A635XYYXYUH3ORCCLYV53U3AVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZRG42DQNRYGE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@j-d-salinger commented on GitHub (Feb 29, 2024):

Yes, how do I fix that?

<!-- gh-comment-id:1971851758 --> @j-d-salinger commented on GitHub (Feb 29, 2024): Yes, how do I fix that?
Author
Owner

@orlyandico commented on GitHub (Feb 29, 2024):

If you could post the logs here (e.g. similar to my original post at the very top) that would be useful.

<!-- gh-comment-id:1971974383 --> @orlyandico commented on GitHub (Feb 29, 2024): If you could post the logs here (e.g. similar to my original post at the very top) that would be useful.
Author
Owner

@j-d-salinger commented on GitHub (Mar 1, 2024):

Here is some startup info, before chatting:

gpus@GGGGG:~/ollama-k80$ OLLAMA_DEBUG=1 ./ollama serve
time=2024-02-29T20:44:04.140-05:00 level=INFO source=images.go:710 msg="total blobs: 17"
time=2024-02-29T20:44:04.142-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/jmorganca/ollama/server.ChatHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/jmorganca/ollama/server.EmbeddingHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/jmorganca/ollama/server.ChatHandler (6 handlers)
[GIN-debug] GET    /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
time=2024-02-29T20:44:04.142-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-02-29T20:44:04.142-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-29T20:44:04.185-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-02-29T20:44:04.185-05:00 level=DEBUG source=payload_common.go:147 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-02-29T20:44:04.185-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-29T20:44:04.185-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-29T20:44:04.185-05:00 level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /home/gpus/ollama-k80/libnvidia-ml.so*]"
time=2024-02-29T20:44:04.187-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.470.223.02]"
wiring nvidia management library functions in /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.470.223.02
dlsym: nvmlInit_v2
dlsym: nvmlShutdown
dlsym: nvmlDeviceGetHandleByIndex
dlsym: nvmlDeviceGetMemoryInfo
dlsym: nvmlDeviceGetCount_v2
dlsym: nvmlDeviceGetCudaComputeCapability
dlsym: nvmlSystemGetDriverVersion
dlsym: nvmlDeviceGetName
dlsym: nvmlDeviceGetSerial
dlsym: nvmlDeviceGetVbiosVersion
dlsym: nvmlDeviceGetBoardPartNumber
dlsym: nvmlDeviceGetBrand
CUDA driver version: 470.223.02
time=2024-02-29T20:44:04.190-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-02-29T20:44:04.190-05:00 level=INFO source=cpu_common.go:15 msg="CPU has AVX"
[0] CUDA device name: Tesla K80
[0] CUDA part number: 900-22080-0000-000
[0] CUDA S/N: 0325216145858
[0] CUDA vbios version: 80.21.1F.00.01
[0] CUDA brand: 2
[0] CUDA totalMem 11997020160
[0] CUDA usedMem 11997020160
[1] CUDA device name: Tesla K80
[1] CUDA part number: 900-22080-0000-000
[1] CUDA S/N: 0325216145858
[1] CUDA vbios version: 80.21.1F.00.02
[1] CUDA brand: 2
[1] CUDA totalMem 11997020160
[1] CUDA usedMem 11997020160
[2] CUDA device name: Tesla K80
[2] CUDA part number: 900-22080-0000-000
[2] CUDA S/N: 0320117146457
[2] CUDA vbios version: 80.21.1F.00.01
[2] CUDA brand: 2
[2] CUDA totalMem 11997020160
[2] CUDA usedMem 11997020160
[3] CUDA device name: Tesla K80
[3] CUDA part number: 900-22080-0000-000
[3] CUDA S/N: 0320117146457
[3] CUDA vbios version: 80.21.1F.00.02
[3] CUDA brand: 2
[3] CUDA totalMem 11997020160
[3] CUDA usedMem 11997020160
time=2024-02-29T20:44:04.216-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 3.7"
time=2024-02-29T20:44:04.216-05:00 level=DEBUG source=gpu.go:254 msg="cuda detected 4 devices with 41188M available memory"


<!-- gh-comment-id:1972306564 --> @j-d-salinger commented on GitHub (Mar 1, 2024): Here is some startup info, before chatting: ``` gpus@GGGGG:~/ollama-k80$ OLLAMA_DEBUG=1 ./ollama serve time=2024-02-29T20:44:04.140-05:00 level=INFO source=images.go:710 msg="total blobs: 17" time=2024-02-29T20:44:04.142-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers) [GIN-debug] POST /api/generate --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers) [GIN-debug] POST /api/chat --> github.com/jmorganca/ollama/server.ChatHandler (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/jmorganca/ollama/server.EmbeddingHandler (5 handlers) [GIN-debug] POST /api/create --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers) [GIN-debug] POST /api/push --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers) [GIN-debug] POST /api/copy --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers) [GIN-debug] POST /api/show --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/jmorganca/ollama/server.ChatHandler (6 handlers) [GIN-debug] GET / --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] GET /api/tags --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers) [GIN-debug] GET /api/version --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers) [GIN-debug] HEAD / --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers) [GIN-debug] HEAD /api/version --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers) time=2024-02-29T20:44:04.142-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-02-29T20:44:04.142-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-02-29T20:44:04.185-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]" time=2024-02-29T20:44:04.185-05:00 level=DEBUG source=payload_common.go:147 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-02-29T20:44:04.185-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-02-29T20:44:04.185-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so" time=2024-02-29T20:44:04.185-05:00 level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /home/gpus/ollama-k80/libnvidia-ml.so*]" time=2024-02-29T20:44:04.187-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.470.223.02]" wiring nvidia management library functions in /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.470.223.02 dlsym: nvmlInit_v2 dlsym: nvmlShutdown dlsym: nvmlDeviceGetHandleByIndex dlsym: nvmlDeviceGetMemoryInfo dlsym: nvmlDeviceGetCount_v2 dlsym: nvmlDeviceGetCudaComputeCapability dlsym: nvmlSystemGetDriverVersion dlsym: nvmlDeviceGetName dlsym: nvmlDeviceGetSerial dlsym: nvmlDeviceGetVbiosVersion dlsym: nvmlDeviceGetBoardPartNumber dlsym: nvmlDeviceGetBrand CUDA driver version: 470.223.02 time=2024-02-29T20:44:04.190-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-02-29T20:44:04.190-05:00 level=INFO source=cpu_common.go:15 msg="CPU has AVX" [0] CUDA device name: Tesla K80 [0] CUDA part number: 900-22080-0000-000 [0] CUDA S/N: 0325216145858 [0] CUDA vbios version: 80.21.1F.00.01 [0] CUDA brand: 2 [0] CUDA totalMem 11997020160 [0] CUDA usedMem 11997020160 [1] CUDA device name: Tesla K80 [1] CUDA part number: 900-22080-0000-000 [1] CUDA S/N: 0325216145858 [1] CUDA vbios version: 80.21.1F.00.02 [1] CUDA brand: 2 [1] CUDA totalMem 11997020160 [1] CUDA usedMem 11997020160 [2] CUDA device name: Tesla K80 [2] CUDA part number: 900-22080-0000-000 [2] CUDA S/N: 0320117146457 [2] CUDA vbios version: 80.21.1F.00.01 [2] CUDA brand: 2 [2] CUDA totalMem 11997020160 [2] CUDA usedMem 11997020160 [3] CUDA device name: Tesla K80 [3] CUDA part number: 900-22080-0000-000 [3] CUDA S/N: 0320117146457 [3] CUDA vbios version: 80.21.1F.00.02 [3] CUDA brand: 2 [3] CUDA totalMem 11997020160 [3] CUDA usedMem 11997020160 time=2024-02-29T20:44:04.216-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 3.7" time=2024-02-29T20:44:04.216-05:00 level=DEBUG source=gpu.go:254 msg="cuda detected 4 devices with 41188M available memory" ```
Author
Owner

@j-d-salinger commented on GitHub (Mar 1, 2024):

Here is some logs when I try to chat. Let me know if you need more, I had to download the model in between

time=2024-02-29T20:45:27.180-05:00 level=INFO source=images.go:710 msg="total blobs: 17"
time=2024-02-29T20:45:27.181-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/jmorganca/ollama/server.ChatHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/jmorganca/ollama/server.EmbeddingHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/jmorganca/ollama/server.ChatHandler (6 handlers)
[GIN-debug] GET    /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
time=2024-02-29T20:45:27.182-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-02-29T20:45:27.182-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-29T20:45:27.224-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx2 cpu cpu_avx]"
time=2024-02-29T20:45:27.224-05:00 level=DEBUG source=payload_common.go:147 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-02-29T20:45:27.224-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-29T20:45:27.225-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-29T20:45:27.225-05:00 level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /home/gpus/ollama-k80/libnvidia-ml.so*]"
time=2024-02-29T20:45:27.226-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.470.223.02]"
wiring nvidia management library functions in /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.470.223.02
dlsym: nvmlInit_v2
dlsym: nvmlShutdown
dlsym: nvmlDeviceGetHandleByIndex
dlsym: nvmlDeviceGetMemoryInfo
dlsym: nvmlDeviceGetCount_v2
dlsym: nvmlDeviceGetCudaComputeCapability
dlsym: nvmlSystemGetDriverVersion
dlsym: nvmlDeviceGetName
dlsym: nvmlDeviceGetSerial
dlsym: nvmlDeviceGetVbiosVersion
dlsym: nvmlDeviceGetBoardPartNumber
dlsym: nvmlDeviceGetBrand
CUDA driver version: 470.223.02
time=2024-02-29T20:45:27.229-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-02-29T20:45:27.229-05:00 level=INFO source=cpu_common.go:15 msg="CPU has AVX"
[0] CUDA device name: Tesla K80
[0] CUDA part number: 900-22080-0000-000
[0] CUDA S/N: 0325216145858
[0] CUDA vbios version: 80.21.1F.00.01
[0] CUDA brand: 2
[0] CUDA totalMem 11997020160
[0] CUDA usedMem 11997020160
[1] CUDA device name: Tesla K80
[1] CUDA part number: 900-22080-0000-000
[1] CUDA S/N: 0325216145858
[1] CUDA vbios version: 80.21.1F.00.02
[1] CUDA brand: 2
[1] CUDA totalMem 11997020160
[1] CUDA usedMem 11997020160
[2] CUDA device name: Tesla K80
[2] CUDA part number: 900-22080-0000-000
[2] CUDA S/N: 0320117146457
[2] CUDA vbios version: 80.21.1F.00.01
[2] CUDA brand: 2
[2] CUDA totalMem 11997020160
[2] CUDA usedMem 11997020160
[3] CUDA device name: Tesla K80
[3] CUDA part number: 900-22080-0000-000
[3] CUDA S/N: 0320117146457
[3] CUDA vbios version: 80.21.1F.00.02
[3] CUDA brand: 2
[3] CUDA totalMem 11997020160
[3] CUDA usedMem 11997020160
time=2024-02-29T20:45:27.255-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 3.7"
time=2024-02-29T20:45:27.255-05:00 level=DEBUG source=gpu.go:254 msg="cuda detected 4 devices with 41188M available memory"
[GIN] 2024/02/29 - 20:45:40 | 200 |      50.824µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/02/29 - 20:45:40 | 404 |     186.008µs |       127.0.0.1 | POST     "/api/show"
time=2024-02-29T20:45:42.514-05:00 level=INFO source=download.go:136 msg="downloading c1864a5eb193 in 17 100 MB part(s)"
time=2024-02-29T20:45:53.141-05:00 level=INFO source=download.go:178 msg="c1864a5eb193 part 8 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-02-29T20:46:07.956-05:00 level=INFO source=download.go:178 msg="c1864a5eb193 part 10 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-02-29T20:46:39.546-05:00 level=INFO source=download.go:136 msg="downloading 097a36493f71 in 1 8.4 KB part(s)"
time=2024-02-29T20:46:42.682-05:00 level=INFO source=download.go:136 msg="downloading 109037bec39c in 1 136 B part(s)"
time=2024-02-29T20:46:44.721-05:00 level=INFO source=download.go:136 msg="downloading 22a838ceb7fb in 1 84 B part(s)"
time=2024-02-29T20:46:47.900-05:00 level=INFO source=download.go:136 msg="downloading 887433b89a90 in 1 483 B part(s)"
[GIN] 2024/02/29 - 20:47:00 | 200 |         1m19s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/02/29 - 20:47:00 | 200 |    1.759027ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/02/29 - 20:47:00 | 200 |    1.085527ms |       127.0.0.1 | POST     "/api/show"
time=2024-02-29T20:47:02.540-05:00 level=INFO source=cpu_common.go:15 msg="CPU has AVX"
[0] CUDA device name: Tesla K80
[0] CUDA part number: 900-22080-0000-000
[0] CUDA S/N: 0325216145858
[0] CUDA vbios version: 80.21.1F.00.01
[0] CUDA brand: 2
[0] CUDA totalMem 11997020160
[0] CUDA usedMem 11997020160
[1] CUDA device name: Tesla K80
[1] CUDA part number: 900-22080-0000-000
[1] CUDA S/N: 0325216145858
[1] CUDA vbios version: 80.21.1F.00.02
[1] CUDA brand: 2
[1] CUDA totalMem 11997020160
[1] CUDA usedMem 11997020160
[2] CUDA device name: Tesla K80
[2] CUDA part number: 900-22080-0000-000
[2] CUDA S/N: 0320117146457
[2] CUDA vbios version: 80.21.1F.00.01
[2] CUDA brand: 2
[2] CUDA totalMem 11997020160
[2] CUDA usedMem 11997020160
[3] CUDA device name: Tesla K80
[3] CUDA part number: 900-22080-0000-000
[3] CUDA S/N: 0320117146457
[3] CUDA vbios version: 80.21.1F.00.02
[3] CUDA brand: 2
[3] CUDA totalMem 11997020160
[3] CUDA usedMem 11997020160
time=2024-02-29T20:47:02.542-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 3.7"
time=2024-02-29T20:47:02.542-05:00 level=DEBUG source=gpu.go:254 msg="cuda detected 4 devices with 41188M available memory"
time=2024-02-29T20:47:02.542-05:00 level=INFO source=cpu_common.go:15 msg="CPU has AVX"
[0] CUDA device name: Tesla K80
[0] CUDA part number: 900-22080-0000-000
[0] CUDA S/N: 0325216145858
[0] CUDA vbios version: 80.21.1F.00.01
[0] CUDA brand: 2
[0] CUDA totalMem 11997020160
[0] CUDA usedMem 11997020160
[1] CUDA device name: Tesla K80
[1] CUDA part number: 900-22080-0000-000
[1] CUDA S/N: 0325216145858
[1] CUDA vbios version: 80.21.1F.00.02
[1] CUDA brand: 2
[1] CUDA totalMem 11997020160
[1] CUDA usedMem 11997020160
[2] CUDA device name: Tesla K80
[2] CUDA part number: 900-22080-0000-000
[2] CUDA S/N: 0320117146457
[2] CUDA vbios version: 80.21.1F.00.01
[2] CUDA brand: 2
[2] CUDA totalMem 11997020160
[2] CUDA usedMem 11997020160
[3] CUDA device name: Tesla K80
[3] CUDA part number: 900-22080-0000-000
[3] CUDA S/N: 0320117146457
[3] CUDA vbios version: 80.21.1F.00.02
[3] CUDA brand: 2
[3] CUDA totalMem 11997020160
[3] CUDA usedMem 11997020160
time=2024-02-29T20:47:02.544-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 3.7"
time=2024-02-29T20:47:02.544-05:00 level=INFO source=cpu_common.go:15 msg="CPU has AVX"
time=2024-02-29T20:47:02.544-05:00 level=DEBUG source=payload_common.go:93 msg="ordered list of LLM libraries to try [/tmp/ollama2256939918/cpu_avx/libext_server.so]"
loading library /tmp/ollama2256939918/cpu_avx/libext_server.so
time=2024-02-29T20:47:02.566-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama2256939918/cpu_avx/libext_server.so"
time=2024-02-29T20:47:02.566-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
[1709257622] system info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
llama_model_loader: loaded meta data with 21 key-value pairs and 164 tensors from /home/gpus/.ollama/models/blobs/sha256:c1864a5eb19305c40519da12cc543519e48a0697ecd30e15d5ac228644957d12 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-2b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 18
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 8
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 1
llama_model_loader: - kv   8:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv   9:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  10:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  14:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,256128]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,256128]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,256128]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - kv  20:                          general.file_type u32              = 2
llama_model_loader: - type  f32:   37 tensors
llama_model_loader: - type q4_0:  126 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 1
llm_load_print_meta: n_layer          = 18
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 2.51 B
llm_load_print_meta: model size       = 1.56 GiB (5.34 BPW)
llm_load_print_meta: general.name     = gemma-2b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.06 MiB
llm_load_tensors:        CPU buffer size =  1594.93 MiB
.....................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    36.00 MiB
llama_new_context_with_model: KV self size  =   36.00 MiB, K (f16):   18.00 MiB, V (f16):   18.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.02 MiB
llama_new_context_with_model:        CPU compute buffer size =   504.25 MiB
llama_new_context_with_model: graph splits (measure): 1
[1709257623] warming up the model with an empty run
{"function":"initialize","level":"INFO","line":494,"msg":"initializing slots","n_slots":1,"tid":"139936973207296","timestamp":1709257623}
{"function":"initialize","level":"INFO","line":503,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"139936973207296","timestamp":1709257623}
time=2024-02-29T20:47:03.431-05:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
time=2024-02-29T20:47:03.431-05:00 level=DEBUG source=prompt.go:170 msg="prompt now fits in context window" required=1 window=2048
[1709257623] llama server main loop starting
{"function":"update_slots","level":"INFO","line":1618,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"139931251394304","timestamp":1709257623}
[GIN] 2024/02/29 - 20:47:03 | 200 |  3.063349653s |       127.0.0.1 | POST     "/api/chat"
time=2024-02-29T20:47:19.524-05:00 level=DEBUG source=prompt.go:170 msg="prompt now fits in context window" required=24 window=2048
time=2024-02-29T20:47:19.524-05:00 level=DEBUG source=routes.go:1225 msg="chat handler" prompt="<start_of_turn>user\nCan you please generate a question that my grandmother would laugh at?<end_of_turn>\n<start_of_turn>model\n" images=0
{"function":"launch_slot_with_data","level":"INFO","line":884,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"139931251394304","timestamp":1709257639}
{"function":"update_slots","level":"INFO","line":1844,"msg":"slot progression","n_past":0,"num_prompt_tokens_processed":22,"slot_id":0,"task_id":0,"tid":"139931251394304","timestamp":1709257639}
{"function":"update_slots","level":"INFO","line":1869,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"139931251394304","timestamp":1709257639}
{"function":"print_timings","level":"INFO","line":316,"msg":"prompt eval time     =    1036.84 ms /    22 tokens (   47.13 ms per token,    21.22 tokens per second)","n_tokens_second":21.218235337234916,"num_prompt_tokens_processed":22,"slot_id":0,"t_prompt_processing":1036.844,"t_token":47.12927272727273,"task_id":0,"tid":"139931251394304","timestamp":1709257642}
{"function":"print_timings","level":"INFO","line":330,"msg":"generation eval time =    2193.01 ms /    32 runs   (   68.53 ms per token,    14.59 tokens per second)","n_decoded":32,"n_tokens_second":14.59184334196988,"slot_id":0,"t_token":68.5314375,"t_token_generation":2193.006,"task_id":0,"tid":"139931251394304","timestamp":1709257642}
{"function":"print_timings","level":"INFO","line":340,"msg":"          total time =    3229.85 ms","slot_id":0,"t_prompt_processing":1036.844,"t_token_generation":2193.006,"t_total":3229.85,"task_id":0,"tid":"139931251394304","timestamp":1709257642}
{"function":"update_slots","level":"INFO","line":1680,"msg":"slot released","n_cache_tokens":54,"n_ctx":2048,"n_past":53,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"139931251394304","timestamp":1709257642,"truncated":false}
[1709257642] next result cancel on stop
[1709257642] next result removing waiting task ID: 0
[GIN] 2024/02/29 - 20:47:22 | 200 |  3.232788536s |       127.0.0.1 | POST     "/api/chat"
time=2024-02-29T20:47:27.463-05:00 level=DEBUG source=prompt.go:170 msg="prompt now fits in context window" required=71 window=2048
time=2024-02-29T20:47:27.463-05:00 level=DEBUG source=routes.go:1225 msg="chat handler" prompt="<start_of_turn>user\nCan you please generate a question that my grandmother would laugh at?<end_of_turn>\n<start_of_turn>model\nSure, here's a question that your grandmother might laugh at:\n\n\"What do you call a fish with no eyes?\"\n\nHope this helps!<end_of_turn>\n<start_of_turn>user\nWhat's the answer?<end_of_turn>\n<start_of_turn>model\n" images=0
{"function":"launch_slot_with_data","level":"INFO","line":884,"msg":"slot is processing task","slot_id":0,"task_id":35,"tid":"139931251394304","timestamp":1709257647}
{"function":"update_slots","level":"INFO","line":1844,"msg":"slot progression","n_past":53,"num_prompt_tokens_processed":16,"slot_id":0,"task_id":35,"tid":"139931251394304","timestamp":1709257647}
{"function":"update_slots","level":"INFO","line":1869,"msg":"kv cache rm [p0, end)","p0":53,"slot_id":0,"task_id":35,"tid":"139931251394304","timestamp":1709257647}
{"function":"print_timings","level":"INFO","line":316,"msg":"prompt eval time     =     763.45 ms /    16 tokens (   47.72 ms per token,    20.96 tokens per second)","n_tokens_second":20.957577932718316,"num_prompt_tokens_processed":16,"slot_id":0,"t_prompt_processing":763.447,"t_token":47.7154375,"task_id":35,"tid":"139931251394304","timestamp":1709257649}
{"function":"print_timings","level":"INFO","line":330,"msg":"generation eval time =    1632.73 ms /    24 runs   (   68.03 ms per token,    14.70 tokens per second)","n_decoded":24,"n_tokens_second":14.699307295143715,"slot_id":0,"t_token":68.03041666666667,"t_token_generation":1632.73,"task_id":35,"tid":"139931251394304","timestamp":1709257649}
{"function":"print_timings","level":"INFO","line":340,"msg":"          total time =    2396.18 ms","slot_id":0,"t_prompt_processing":763.447,"t_token_generation":1632.73,"t_total":2396.177,"task_id":35,"tid":"139931251394304","timestamp":1709257649}
{"function":"update_slots","level":"INFO","line":1680,"msg":"slot released","n_cache_tokens":93,"n_ctx":2048,"n_past":92,"n_system_tokens":0,"slot_id":0,"task_id":35,"tid":"139931251394304","timestamp":1709257649,"truncated":false}
[1709257649] next result cancel on stop
[1709257649] next result removing waiting task ID: 35
[GIN] 2024/02/29 - 20:47:29 | 200 |  2.401455703s |       127.0.0.1 | POST     "/api/chat"


<!-- gh-comment-id:1972307088 --> @j-d-salinger commented on GitHub (Mar 1, 2024): Here is some logs when I try to chat. Let me know if you need more, I had to download the model in between ``` time=2024-02-29T20:45:27.180-05:00 level=INFO source=images.go:710 msg="total blobs: 17" time=2024-02-29T20:45:27.181-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers) [GIN-debug] POST /api/generate --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers) [GIN-debug] POST /api/chat --> github.com/jmorganca/ollama/server.ChatHandler (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/jmorganca/ollama/server.EmbeddingHandler (5 handlers) [GIN-debug] POST /api/create --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers) [GIN-debug] POST /api/push --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers) [GIN-debug] POST /api/copy --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers) [GIN-debug] POST /api/show --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/jmorganca/ollama/server.ChatHandler (6 handlers) [GIN-debug] GET / --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] GET /api/tags --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers) [GIN-debug] GET /api/version --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers) [GIN-debug] HEAD / --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers) [GIN-debug] HEAD /api/version --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers) time=2024-02-29T20:45:27.182-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-02-29T20:45:27.182-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-02-29T20:45:27.224-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx2 cpu cpu_avx]" time=2024-02-29T20:45:27.224-05:00 level=DEBUG source=payload_common.go:147 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-02-29T20:45:27.224-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-02-29T20:45:27.225-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so" time=2024-02-29T20:45:27.225-05:00 level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /home/gpus/ollama-k80/libnvidia-ml.so*]" time=2024-02-29T20:45:27.226-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.470.223.02]" wiring nvidia management library functions in /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.470.223.02 dlsym: nvmlInit_v2 dlsym: nvmlShutdown dlsym: nvmlDeviceGetHandleByIndex dlsym: nvmlDeviceGetMemoryInfo dlsym: nvmlDeviceGetCount_v2 dlsym: nvmlDeviceGetCudaComputeCapability dlsym: nvmlSystemGetDriverVersion dlsym: nvmlDeviceGetName dlsym: nvmlDeviceGetSerial dlsym: nvmlDeviceGetVbiosVersion dlsym: nvmlDeviceGetBoardPartNumber dlsym: nvmlDeviceGetBrand CUDA driver version: 470.223.02 time=2024-02-29T20:45:27.229-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-02-29T20:45:27.229-05:00 level=INFO source=cpu_common.go:15 msg="CPU has AVX" [0] CUDA device name: Tesla K80 [0] CUDA part number: 900-22080-0000-000 [0] CUDA S/N: 0325216145858 [0] CUDA vbios version: 80.21.1F.00.01 [0] CUDA brand: 2 [0] CUDA totalMem 11997020160 [0] CUDA usedMem 11997020160 [1] CUDA device name: Tesla K80 [1] CUDA part number: 900-22080-0000-000 [1] CUDA S/N: 0325216145858 [1] CUDA vbios version: 80.21.1F.00.02 [1] CUDA brand: 2 [1] CUDA totalMem 11997020160 [1] CUDA usedMem 11997020160 [2] CUDA device name: Tesla K80 [2] CUDA part number: 900-22080-0000-000 [2] CUDA S/N: 0320117146457 [2] CUDA vbios version: 80.21.1F.00.01 [2] CUDA brand: 2 [2] CUDA totalMem 11997020160 [2] CUDA usedMem 11997020160 [3] CUDA device name: Tesla K80 [3] CUDA part number: 900-22080-0000-000 [3] CUDA S/N: 0320117146457 [3] CUDA vbios version: 80.21.1F.00.02 [3] CUDA brand: 2 [3] CUDA totalMem 11997020160 [3] CUDA usedMem 11997020160 time=2024-02-29T20:45:27.255-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 3.7" time=2024-02-29T20:45:27.255-05:00 level=DEBUG source=gpu.go:254 msg="cuda detected 4 devices with 41188M available memory" [GIN] 2024/02/29 - 20:45:40 | 200 | 50.824µs | 127.0.0.1 | HEAD "/" [GIN] 2024/02/29 - 20:45:40 | 404 | 186.008µs | 127.0.0.1 | POST "/api/show" time=2024-02-29T20:45:42.514-05:00 level=INFO source=download.go:136 msg="downloading c1864a5eb193 in 17 100 MB part(s)" time=2024-02-29T20:45:53.141-05:00 level=INFO source=download.go:178 msg="c1864a5eb193 part 8 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-02-29T20:46:07.956-05:00 level=INFO source=download.go:178 msg="c1864a5eb193 part 10 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-02-29T20:46:39.546-05:00 level=INFO source=download.go:136 msg="downloading 097a36493f71 in 1 8.4 KB part(s)" time=2024-02-29T20:46:42.682-05:00 level=INFO source=download.go:136 msg="downloading 109037bec39c in 1 136 B part(s)" time=2024-02-29T20:46:44.721-05:00 level=INFO source=download.go:136 msg="downloading 22a838ceb7fb in 1 84 B part(s)" time=2024-02-29T20:46:47.900-05:00 level=INFO source=download.go:136 msg="downloading 887433b89a90 in 1 483 B part(s)" [GIN] 2024/02/29 - 20:47:00 | 200 | 1m19s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/02/29 - 20:47:00 | 200 | 1.759027ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/02/29 - 20:47:00 | 200 | 1.085527ms | 127.0.0.1 | POST "/api/show" time=2024-02-29T20:47:02.540-05:00 level=INFO source=cpu_common.go:15 msg="CPU has AVX" [0] CUDA device name: Tesla K80 [0] CUDA part number: 900-22080-0000-000 [0] CUDA S/N: 0325216145858 [0] CUDA vbios version: 80.21.1F.00.01 [0] CUDA brand: 2 [0] CUDA totalMem 11997020160 [0] CUDA usedMem 11997020160 [1] CUDA device name: Tesla K80 [1] CUDA part number: 900-22080-0000-000 [1] CUDA S/N: 0325216145858 [1] CUDA vbios version: 80.21.1F.00.02 [1] CUDA brand: 2 [1] CUDA totalMem 11997020160 [1] CUDA usedMem 11997020160 [2] CUDA device name: Tesla K80 [2] CUDA part number: 900-22080-0000-000 [2] CUDA S/N: 0320117146457 [2] CUDA vbios version: 80.21.1F.00.01 [2] CUDA brand: 2 [2] CUDA totalMem 11997020160 [2] CUDA usedMem 11997020160 [3] CUDA device name: Tesla K80 [3] CUDA part number: 900-22080-0000-000 [3] CUDA S/N: 0320117146457 [3] CUDA vbios version: 80.21.1F.00.02 [3] CUDA brand: 2 [3] CUDA totalMem 11997020160 [3] CUDA usedMem 11997020160 time=2024-02-29T20:47:02.542-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 3.7" time=2024-02-29T20:47:02.542-05:00 level=DEBUG source=gpu.go:254 msg="cuda detected 4 devices with 41188M available memory" time=2024-02-29T20:47:02.542-05:00 level=INFO source=cpu_common.go:15 msg="CPU has AVX" [0] CUDA device name: Tesla K80 [0] CUDA part number: 900-22080-0000-000 [0] CUDA S/N: 0325216145858 [0] CUDA vbios version: 80.21.1F.00.01 [0] CUDA brand: 2 [0] CUDA totalMem 11997020160 [0] CUDA usedMem 11997020160 [1] CUDA device name: Tesla K80 [1] CUDA part number: 900-22080-0000-000 [1] CUDA S/N: 0325216145858 [1] CUDA vbios version: 80.21.1F.00.02 [1] CUDA brand: 2 [1] CUDA totalMem 11997020160 [1] CUDA usedMem 11997020160 [2] CUDA device name: Tesla K80 [2] CUDA part number: 900-22080-0000-000 [2] CUDA S/N: 0320117146457 [2] CUDA vbios version: 80.21.1F.00.01 [2] CUDA brand: 2 [2] CUDA totalMem 11997020160 [2] CUDA usedMem 11997020160 [3] CUDA device name: Tesla K80 [3] CUDA part number: 900-22080-0000-000 [3] CUDA S/N: 0320117146457 [3] CUDA vbios version: 80.21.1F.00.02 [3] CUDA brand: 2 [3] CUDA totalMem 11997020160 [3] CUDA usedMem 11997020160 time=2024-02-29T20:47:02.544-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 3.7" time=2024-02-29T20:47:02.544-05:00 level=INFO source=cpu_common.go:15 msg="CPU has AVX" time=2024-02-29T20:47:02.544-05:00 level=DEBUG source=payload_common.go:93 msg="ordered list of LLM libraries to try [/tmp/ollama2256939918/cpu_avx/libext_server.so]" loading library /tmp/ollama2256939918/cpu_avx/libext_server.so time=2024-02-29T20:47:02.566-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama2256939918/cpu_avx/libext_server.so" time=2024-02-29T20:47:02.566-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" [1709257622] system info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | llama_model_loader: loaded meta data with 21 key-value pairs and 164 tensors from /home/gpus/.ollama/models/blobs/sha256:c1864a5eb19305c40519da12cc543519e48a0697ecd30e15d5ac228644957d12 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-2b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.block_count u32 = 18 llama_model_loader: - kv 4: gemma.embedding_length u32 = 2048 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 16384 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 8 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 1 llama_model_loader: - kv 8: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 9: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 14: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,256128] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,256128] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,256128] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - kv 20: general.file_type u32 = 2 llama_model_loader: - type f32: 37 tensors llama_model_loader: - type q4_0: 126 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 1 llm_load_print_meta: n_layer = 18 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 16384 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.51 B llm_load_print_meta: model size = 1.56 GiB (5.34 BPW) llm_load_print_meta: general.name = gemma-2b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_tensors: ggml ctx size = 0.06 MiB llm_load_tensors: CPU buffer size = 1594.93 MiB ..................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 36.00 MiB llama_new_context_with_model: KV self size = 36.00 MiB, K (f16): 18.00 MiB, V (f16): 18.00 MiB llama_new_context_with_model: CPU input buffer size = 9.02 MiB llama_new_context_with_model: CPU compute buffer size = 504.25 MiB llama_new_context_with_model: graph splits (measure): 1 [1709257623] warming up the model with an empty run {"function":"initialize","level":"INFO","line":494,"msg":"initializing slots","n_slots":1,"tid":"139936973207296","timestamp":1709257623} {"function":"initialize","level":"INFO","line":503,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"139936973207296","timestamp":1709257623} time=2024-02-29T20:47:03.431-05:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" time=2024-02-29T20:47:03.431-05:00 level=DEBUG source=prompt.go:170 msg="prompt now fits in context window" required=1 window=2048 [1709257623] llama server main loop starting {"function":"update_slots","level":"INFO","line":1618,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"139931251394304","timestamp":1709257623} [GIN] 2024/02/29 - 20:47:03 | 200 | 3.063349653s | 127.0.0.1 | POST "/api/chat" time=2024-02-29T20:47:19.524-05:00 level=DEBUG source=prompt.go:170 msg="prompt now fits in context window" required=24 window=2048 time=2024-02-29T20:47:19.524-05:00 level=DEBUG source=routes.go:1225 msg="chat handler" prompt="<start_of_turn>user\nCan you please generate a question that my grandmother would laugh at?<end_of_turn>\n<start_of_turn>model\n" images=0 {"function":"launch_slot_with_data","level":"INFO","line":884,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"139931251394304","timestamp":1709257639} {"function":"update_slots","level":"INFO","line":1844,"msg":"slot progression","n_past":0,"num_prompt_tokens_processed":22,"slot_id":0,"task_id":0,"tid":"139931251394304","timestamp":1709257639} {"function":"update_slots","level":"INFO","line":1869,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"139931251394304","timestamp":1709257639} {"function":"print_timings","level":"INFO","line":316,"msg":"prompt eval time = 1036.84 ms / 22 tokens ( 47.13 ms per token, 21.22 tokens per second)","n_tokens_second":21.218235337234916,"num_prompt_tokens_processed":22,"slot_id":0,"t_prompt_processing":1036.844,"t_token":47.12927272727273,"task_id":0,"tid":"139931251394304","timestamp":1709257642} {"function":"print_timings","level":"INFO","line":330,"msg":"generation eval time = 2193.01 ms / 32 runs ( 68.53 ms per token, 14.59 tokens per second)","n_decoded":32,"n_tokens_second":14.59184334196988,"slot_id":0,"t_token":68.5314375,"t_token_generation":2193.006,"task_id":0,"tid":"139931251394304","timestamp":1709257642} {"function":"print_timings","level":"INFO","line":340,"msg":" total time = 3229.85 ms","slot_id":0,"t_prompt_processing":1036.844,"t_token_generation":2193.006,"t_total":3229.85,"task_id":0,"tid":"139931251394304","timestamp":1709257642} {"function":"update_slots","level":"INFO","line":1680,"msg":"slot released","n_cache_tokens":54,"n_ctx":2048,"n_past":53,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"139931251394304","timestamp":1709257642,"truncated":false} [1709257642] next result cancel on stop [1709257642] next result removing waiting task ID: 0 [GIN] 2024/02/29 - 20:47:22 | 200 | 3.232788536s | 127.0.0.1 | POST "/api/chat" time=2024-02-29T20:47:27.463-05:00 level=DEBUG source=prompt.go:170 msg="prompt now fits in context window" required=71 window=2048 time=2024-02-29T20:47:27.463-05:00 level=DEBUG source=routes.go:1225 msg="chat handler" prompt="<start_of_turn>user\nCan you please generate a question that my grandmother would laugh at?<end_of_turn>\n<start_of_turn>model\nSure, here's a question that your grandmother might laugh at:\n\n\"What do you call a fish with no eyes?\"\n\nHope this helps!<end_of_turn>\n<start_of_turn>user\nWhat's the answer?<end_of_turn>\n<start_of_turn>model\n" images=0 {"function":"launch_slot_with_data","level":"INFO","line":884,"msg":"slot is processing task","slot_id":0,"task_id":35,"tid":"139931251394304","timestamp":1709257647} {"function":"update_slots","level":"INFO","line":1844,"msg":"slot progression","n_past":53,"num_prompt_tokens_processed":16,"slot_id":0,"task_id":35,"tid":"139931251394304","timestamp":1709257647} {"function":"update_slots","level":"INFO","line":1869,"msg":"kv cache rm [p0, end)","p0":53,"slot_id":0,"task_id":35,"tid":"139931251394304","timestamp":1709257647} {"function":"print_timings","level":"INFO","line":316,"msg":"prompt eval time = 763.45 ms / 16 tokens ( 47.72 ms per token, 20.96 tokens per second)","n_tokens_second":20.957577932718316,"num_prompt_tokens_processed":16,"slot_id":0,"t_prompt_processing":763.447,"t_token":47.7154375,"task_id":35,"tid":"139931251394304","timestamp":1709257649} {"function":"print_timings","level":"INFO","line":330,"msg":"generation eval time = 1632.73 ms / 24 runs ( 68.03 ms per token, 14.70 tokens per second)","n_decoded":24,"n_tokens_second":14.699307295143715,"slot_id":0,"t_token":68.03041666666667,"t_token_generation":1632.73,"task_id":35,"tid":"139931251394304","timestamp":1709257649} {"function":"print_timings","level":"INFO","line":340,"msg":" total time = 2396.18 ms","slot_id":0,"t_prompt_processing":763.447,"t_token_generation":1632.73,"t_total":2396.177,"task_id":35,"tid":"139931251394304","timestamp":1709257649} {"function":"update_slots","level":"INFO","line":1680,"msg":"slot released","n_cache_tokens":93,"n_ctx":2048,"n_past":92,"n_system_tokens":0,"slot_id":0,"task_id":35,"tid":"139931251394304","timestamp":1709257649,"truncated":false} [1709257649] next result cancel on stop [1709257649] next result removing waiting task ID: 35 [GIN] 2024/02/29 - 20:47:29 | 200 | 2.401455703s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@j-d-salinger commented on GitHub (Mar 2, 2024):

That issue may have been caused by pulling the repo from master, and building from that. Rather than "checking out" the latest release (v0.1.27 as of today), applying the patch, and testing on the K80.

Unfortunately I only solved this once I had de-installed the K80s for a P40. If i have time i'll go back and test the K80s with the latest release.

<!-- gh-comment-id:1974918564 --> @j-d-salinger commented on GitHub (Mar 2, 2024): That issue may have been caused by pulling the repo from master, and building from that. Rather than "checking out" the **latest release** (v0.1.27 as of today), applying the patch, and testing on the K80. Unfortunately I only solved this once I had de-installed the K80s for a P40. If i have time i'll go back and test the K80s with the latest release.
Author
Owner

@dhiltgen commented on GitHub (Mar 25, 2024):

I'm not sure when I'll have a chance to get back to this one, so this would be a great community contribution if someone's up for it.

The rough design is to modify the linux and windows gen_* scripts here so that by setting some env var before calling go generate ./... we'd add the CMAKE_CUDA_ARCHITECTURES for 35 and 37. Then we'd need to refactor the CudaComputeMin in gpu.go so that it's easy to override at build time. (Look to how we set the version.go setting in the build script ) Then doc it all so it's easy for folks with these older cards to install an older cuda version that still supports 3.5, and build from source. It might look something like this:

OLLAMA_CUSTOM_CUDA_ARCH="35;37" go generate ./...
go build '-ldflags=-w -s "-X=github.com/ollama/ollama/gpu.CudaMinVersion=3.5" .
<!-- gh-comment-id:2019112270 --> @dhiltgen commented on GitHub (Mar 25, 2024): I'm not sure when I'll have a chance to get back to this one, so this would be a great community contribution if someone's up for it. The rough design is to modify the linux and windows gen_* scripts [here](https://github.com/ollama/ollama/tree/main/llm/generate) so that by setting some env var before calling `go generate ./...` we'd add the CMAKE_CUDA_ARCHITECTURES for 35 and 37. Then we'd need to refactor the CudaComputeMin in [gpu.go](https://github.com/ollama/ollama/blob/main/gpu/gpu.go) so that it's easy to override at build time. (Look to how we set the [version.go](https://github.com/ollama/ollama/blob/main/version/version.go) setting in the [build script](https://github.com/ollama/ollama/blob/main/scripts/build_linux.sh#L6) ) Then doc it all so it's easy for folks with these older cards to install an older cuda version that still supports 3.5, and build from source. It might look something like this: ``` OLLAMA_CUSTOM_CUDA_ARCH="35;37" go generate ./... go build '-ldflags=-w -s "-X=github.com/ollama/ollama/gpu.CudaMinVersion=3.5" . ```
Author
Owner

@langstonmeister commented on GitHub (Aug 4, 2024):

I just built the files, like here:

OLLAMA_CUSTOM_CUDA_ARCH="35;37" go generate ./...
go build '-ldflags=-w -s "-X=github.com/ollama/ollama/gpu.CudaMinVersion=3.5"' .

and it worked, but I am still not getting any action on the GPU, just from CPU. I am going to try again from the beginning with the latest from the git and I'll report back.

I am trying to build for a k40. I have 2 that I would like to use, but so far no luck.

edit: I recompiled it all from the main source, and I'm getting the same errors.

time=2024-08-05T00:29:10.091Z level=INFO source=routes.go:1155 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-08-05T00:29:10.091Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama585910813/runners
time=2024-08-05T00:29:13.590Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11]"
time=2024-08-05T00:29:13.590Z level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
time=2024-08-05T00:29:15.345Z level=INFO source=gpu.go:265 msg="[0] CUDA GPU is too old. Compute Capability detected: 3.5"

Even though CUDA 3.5 was supposed to be supported.

<!-- gh-comment-id:2267948449 --> @langstonmeister commented on GitHub (Aug 4, 2024): I just built the files, like here: > > ``` > OLLAMA_CUSTOM_CUDA_ARCH="35;37" go generate ./... > go build '-ldflags=-w -s "-X=github.com/ollama/ollama/gpu.CudaMinVersion=3.5"' . > ``` and it worked, but I am still not getting any action on the GPU, just from CPU. I am going to try again from the beginning with the latest from the git and I'll report back. I am trying to build for a k40. I have 2 that I would like to use, but so far no luck. edit: I recompiled it all from the main source, and I'm getting the same errors. ``` time=2024-08-05T00:29:10.091Z level=INFO source=routes.go:1155 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-08-05T00:29:10.091Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama585910813/runners time=2024-08-05T00:29:13.590Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11]" time=2024-08-05T00:29:13.590Z level=INFO source=gpu.go:204 msg="looking for compatible GPUs" time=2024-08-05T00:29:15.345Z level=INFO source=gpu.go:265 msg="[0] CUDA GPU is too old. Compute Capability detected: 3.5" ``` Even though CUDA 3.5 was supposed to be supported.
Author
Owner

@langstonmeister commented on GitHub (Aug 5, 2024):

Okay, I was able to get something to work! I'm not there yet, but hopefully someone smarter than me can fill in some gaps.

Turns out that the CMAKE_CUDA_ARCHITECTURES is still passing the compute versions to the compiler. What I was able to get working was this command:

OLLAMA_CUSTOM_CUDA_ARCH="35;37" CMAKE_CUDA_ARCHITECTURES="35;37" go generate ./...

I'm still having the issue where it tells me that my GPU is too old, but it is showing that there is about 64MB in the vram.

One annoying thing is that I keep having to install the cuda-toolkit each time I want to compile, but then reinstall the utils-470 when I want to try running it. It would seem that this generation did not have nvidia-smi available along with the cuda-toolkit. I could see users being frustrated by this.

<!-- gh-comment-id:2267994333 --> @langstonmeister commented on GitHub (Aug 5, 2024): Okay, I was able to get something to work! I'm not there yet, but hopefully someone smarter than me can fill in some gaps. Turns out that the CMAKE_CUDA_ARCHITECTURES is still passing the compute versions to the compiler. What I was able to get working was this command: ``` OLLAMA_CUSTOM_CUDA_ARCH="35;37" CMAKE_CUDA_ARCHITECTURES="35;37" go generate ./... ``` I'm still having the issue where it tells me that my GPU is too old, but it is showing that there is about 64MB in the vram. One annoying thing is that I keep having to install the cuda-toolkit each time I want to compile, but then reinstall the utils-470 when I want to try running it. It would seem that this generation did not have nvidia-smi available along with the cuda-toolkit. I could see users being frustrated by this.
Author
Owner

@dhiltgen commented on GitHub (Aug 6, 2024):

@langstonmeister check out #2233 for some minor changes required to get things working on CC 3.5 and 3.7 GPUs.

<!-- gh-comment-id:2270142054 --> @dhiltgen commented on GitHub (Aug 6, 2024): @langstonmeister check out #2233 for some minor changes required to get things working on CC 3.5 and 3.7 GPUs.
Author
Owner

@orlyandico commented on GitHub (Aug 6, 2024):

Possibly extremely dumb question/observation, sorry if I missed something..

I saw an error in the github page

/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.183.06: nvml vram init failure: 9

That seems to say that a later version than the nvidia 470 driver and
matching CUDA toolkit have been installed.

As far as I can recall the CUDA Compute 3.5/3.7 cards only support the
470 driver and CUDA toolkit 11.4 (nothing later).

This becomes a further challenge because CUDA toolkit 11.4 is not
supported or available on anything later than Ubuntu 20.04

On Tue, Aug 6, 2024 at 1:24 AM Daniel Hiltgen @.***>
wrote:

@langstonmeister https://github.com/langstonmeister check out #2233
https://github.com/ollama/ollama/pull/2233 for some minor changes
required to get things working on CC 3.5 and 3.7 GPUs.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/1756#issuecomment-2270142054,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKDS3DXIC5S6RUGXZTBYZDZQAJU5AVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZQGE2DEMBVGQ
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:2270632970 --> @orlyandico commented on GitHub (Aug 6, 2024): Possibly extremely dumb question/observation, sorry if I missed something.. I saw an error in the github page /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.183.06: nvml vram init failure: 9 That seems to say that a later version than the nvidia 470 driver and matching CUDA toolkit have been installed. As far as I can recall the CUDA Compute 3.5/3.7 cards only support the 470 driver and CUDA toolkit 11.4 (nothing later). This becomes a further challenge because CUDA toolkit 11.4 is not supported or available on anything later than Ubuntu 20.04 On Tue, Aug 6, 2024 at 1:24 AM Daniel Hiltgen ***@***.***> wrote: > @langstonmeister <https://github.com/langstonmeister> check out #2233 > <https://github.com/ollama/ollama/pull/2233> for some minor changes > required to get things working on CC 3.5 and 3.7 GPUs. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/1756#issuecomment-2270142054>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAKDS3DXIC5S6RUGXZTBYZDZQAJU5AVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZQGE2DEMBVGQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@langstonmeister commented on GitHub (Aug 10, 2024):

Yeah, it has to be 11.4. It will not compile with an older version of the CUDA toolkit. I was able to install the CUDA 11.4 and driver-470 on Ubuntu 24.04 and it's working so far. I needed to use the driver provided by Ubuntu and install just the CUDA stuff from NVIDIA.

My previous logs were from a system that I messed up pretty bad... ended up reinstalling the OS and starting from scratch with just the Ubuntu driver and CUDA 11.4 and it works.

<!-- gh-comment-id:2278913193 --> @langstonmeister commented on GitHub (Aug 10, 2024): Yeah, it has to be 11.4. It will not compile with an older version of the CUDA toolkit. I was able to install the CUDA 11.4 and driver-470 on Ubuntu 24.04 and it's working so far. I needed to use the driver provided by Ubuntu and install just the CUDA stuff from NVIDIA. My previous logs were from a system that I messed up pretty bad... ended up reinstalling the OS and starting from scratch with just the Ubuntu driver and CUDA 11.4 and it works.
Author
Owner

@ZeroZen270 commented on GitHub (Aug 17, 2024):

Yeah, it has to be 11.4. It will not compile with an older version of the CUDA toolkit. I was able to install the CUDA 11.4 and driver-470 on Ubuntu 24.04 and it's working so far. I needed to use the driver provided by Ubuntu and install just the CUDA stuff from NVIDIA.

My previous logs were from a system that I messed up pretty bad... ended up reinstalling the OS and starting from scratch with just the Ubuntu driver and CUDA 11.4 and it works.

So, to be clear you got Ollama working, after a rebuild with a k80? If so, what steps would one take in ubuntu noble? ty in advance

<!-- gh-comment-id:2294795613 --> @ZeroZen270 commented on GitHub (Aug 17, 2024): > Yeah, it has to be 11.4. It will not compile with an older version of the CUDA toolkit. I was able to install the CUDA 11.4 and driver-470 on Ubuntu 24.04 and it's working so far. I needed to use the driver provided by Ubuntu and install just the CUDA stuff from NVIDIA. > > My previous logs were from a system that I messed up pretty bad... ended up reinstalling the OS and starting from scratch with just the Ubuntu driver and CUDA 11.4 and it works. So, to be clear you got Ollama working, after a rebuild with a k80? If so, what steps would one take in ubuntu noble? ty in advance
Author
Owner

@langstonmeister commented on GitHub (Aug 18, 2024):

Technically I got it working on 2 k40c cards, but I would assume that it should also work for the k80.

  1. Install driver from ubuntu (I'm running this on a headless server)
sudo ubuntu-drivers install nvidia-driver-470-server
  1. Download and Install CUDA 11.4 from Nvidia site
  • Be sure that you do NOT try to install the driver from here; it will fail. The override flag tells it to skip the check of gcc version, which I had to do each time.
wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/
cuda_11.4.0_470.42.01_linux.run

sudo sh cuda_11.4.0_470.42.01_linux.run --override
  1. Download Ollama (check instructions here for linux manual install)
  2. Run it!
CMAKE_CUDA_ARCHITECTURES="35;37" go generate ./...

I think I had to make one more edit, but I can't remember where it was now. Or else it was when I installed open-webui on top of this. I'll try to update my repos with all my stuff in the next few days, so hopefully you could just clone my repo and get it all done.

I hope that helps, it took me a while to get through it all, but I feel like it was worth it. This system is much faster than my other server with a GTX1660, and I can run some pretty huge models across the 2 cards with 24GB of VRAM in total.

<!-- gh-comment-id:2295265742 --> @langstonmeister commented on GitHub (Aug 18, 2024): Technically I got it working on 2 k40c cards, but I would assume that it should also work for the k80. 1. Install driver from ubuntu (I'm running this on a headless server) ``` sudo ubuntu-drivers install nvidia-driver-470-server ``` 2. Download and Install CUDA 11.4 from Nvidia site - Be sure that you do NOT try to install the driver from here; it will fail. The override flag tells it to skip the check of gcc version, which I had to do each time. ``` wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/ cuda_11.4.0_470.42.01_linux.run sudo sh cuda_11.4.0_470.42.01_linux.run --override ``` 3. Download Ollama (check instructions [here](https://github.com/ollama/ollama/blob/main/docs/linux.md) for linux manual install) 4. Run it! ``` CMAKE_CUDA_ARCHITECTURES="35;37" go generate ./... ``` I think I had to make one more edit, but I can't remember where it was now. Or else it was when I installed open-webui on top of this. I'll try to update my repos with all my stuff in the next few days, so hopefully you could just clone my repo and get it all done. I hope that helps, it took me a while to get through it all, but I feel like it was worth it. This system is much faster than my other server with a GTX1660, and I can run some pretty huge models across the 2 cards with 24GB of VRAM in total.
Author
Owner

@langstonmeister commented on GitHub (Aug 18, 2024):

I did have to make one more edit! In the gpu/gpu.go file, change the line about CUDA compute to:

sudo nano gpu/gpu.go
var CudaComputeMin = [2]C.int{3, 0}

so that it will not throw errors.

<!-- gh-comment-id:2295271204 --> @langstonmeister commented on GitHub (Aug 18, 2024): I did have to make one more edit! In the gpu/gpu.go file, change the line about CUDA compute to: ``` sudo nano gpu/gpu.go ``` ``` var CudaComputeMin = [2]C.int{3, 0} ``` so that it will not throw errors.
Author
Owner

@orlyandico commented on GitHub (Aug 18, 2024):

To clarify: the above steps work on Ubuntu 24?

I had some serious problems even getting the 11.4 CUDA software installing
on 22 (seeing as it's supposed to be supported for 20.04)

On Sun, Aug 18, 2024 at 2:51 PM langstonmeister @.***>
wrote:

I did have to make one more edit! In the gpu/gpu.go file, change the line
about CUDA compute to:

sudo nano gpu/gpu.go

var CudaComputeMin = [2]C.int{3, 0}

so that it will not throw errors.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/1756#issuecomment-2295271204,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKDS3H2ALI2ILNPRT7WOKLZSCRGRAVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJVGI3TCMRQGQ
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:2295275138 --> @orlyandico commented on GitHub (Aug 18, 2024): To clarify: the above steps work on Ubuntu 24? I had some serious problems even getting the 11.4 CUDA software installing on 22 (seeing as it's supposed to be supported for 20.04) On Sun, Aug 18, 2024 at 2:51 PM langstonmeister ***@***.***> wrote: > I did have to make one more edit! In the gpu/gpu.go file, change the line > about CUDA compute to: > > sudo nano gpu/gpu.go > > var CudaComputeMin = [2]C.int{3, 0} > > so that it will not throw errors. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/1756#issuecomment-2295271204>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAKDS3H2ALI2ILNPRT7WOKLZSCRGRAVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJVGI3TCMRQGQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@langstonmeister commented on GitHub (Aug 18, 2024):

I did not have any issues with it. The only thing is making sure not to try to install the driver that comes packaged with the CUDA tools. Keep the Ubuntu-driver one, and just install the cuda version from that package.

<!-- gh-comment-id:2295283069 --> @langstonmeister commented on GitHub (Aug 18, 2024): I did not have any issues with it. The only thing is making sure not to try to install the driver that comes packaged with the CUDA tools. Keep the Ubuntu-driver one, and just install the cuda version from that package.
Author
Owner

@simsi-andy commented on GitHub (Sep 17, 2024):

Any advice for Windows users? (Besides of changing to Ubuntu) 😉😬

<!-- gh-comment-id:2356530172 --> @simsi-andy commented on GitHub (Sep 17, 2024): Any advice for Windows users? (Besides of changing to Ubuntu) 😉😬
Author
Owner

@bones0 commented on GitHub (Dec 10, 2024):

Hi
There is a fork https://github.com/austinksmith/ollama37 but it seems to be a bit behind the current ollama-version.

Personally I ran into too many problems with my K40/K80. Some projects flatly refuse the NVIDIA-driver 470 as "too old" etc. I'd have to build a "Legacy Rig" only for the K (and soon also the M). Which is not on top of my bullet list.

<!-- gh-comment-id:2530883240 --> @bones0 commented on GitHub (Dec 10, 2024): Hi There is a fork https://github.com/austinksmith/ollama37 but it seems to be a bit behind the current ollama-version. Personally I ran into too many problems with my K40/K80. Some projects flatly refuse the NVIDIA-driver 470 as "too old" etc. I'd have to build a "Legacy Rig" only for the K (and soon also the M). Which is not on top of my bullet list.
Author
Owner

@dhiltgen commented on GitHub (Dec 10, 2024):

For instructions building from source for these older GPUs, see https://github.com/ollama/ollama/blob/main/docs/development.md#older-linux-cuda-nvidia

<!-- gh-comment-id:2532587252 --> @dhiltgen commented on GitHub (Dec 10, 2024): For instructions building from source for these older GPUs, see https://github.com/ollama/ollama/blob/main/docs/development.md#older-linux-cuda-nvidia
Author
Owner

@ShadowGallery93 commented on GitHub (Jan 18, 2025):

This is awesome!! The back compatibility to compute 3.7 is clutch!
My tips:

  • K80 driver support stops somewhere above kernel 5.15.0 so need distro that supports it . Tested on ubuntu 20.04
  • Do not try the script or blanket install nvidia offers for k80/legacy cards, especially if you have new gpu employed.
  • instead sudo add-apt-repository ppa:graphics-drivers/ppa
  • once repo is added sudo apt update && sudo apt install nvidia-driver-470 if using the k80
  • once driver is installed sudo apt install cuda-toolkit-11-4 -y
  • reboot
  • sudo apt install gcc g++ golang
  • run 'go version'-------if version below 1.22 manually upgrade via golang web download
  • download ollama source via zip or git clone
  • extract and nav to the /ollama/ollama/make subdir.
  • modify the cuda_legacy_v11.Makefile
  • where it say CUDA_ARCHITECHTURE=.......
  • delete the numbuer following the = and replace with 35;37;50;52
  • save, exit file, nav back to /ollama/ollama dir
  • run make -j 5 CUDA_ARCHITECTURES="35;37;50;52" EXTRA_GOLDFLAGS=""-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3" "-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=5""
  • run ./ollama
<!-- gh-comment-id:2599599668 --> @ShadowGallery93 commented on GitHub (Jan 18, 2025): This is awesome!! The back compatibility to compute 3.7 is clutch! My tips: - K80 driver support stops somewhere above kernel 5.15.0 so need distro that supports it . Tested on ubuntu 20.04 - Do not try the script or blanket install nvidia offers for k80/legacy cards, especially if you have new gpu employed. - instead sudo add-apt-repository ppa:graphics-drivers/ppa - once repo is added sudo apt update && sudo apt install nvidia-driver-470 if using the k80 - once driver is installed sudo apt install cuda-toolkit-11-4 -y - reboot - sudo apt install gcc g++ golang - run 'go version'-------if version below 1.22 manually upgrade via golang web download - download ollama source via zip or git clone - extract and nav to the /ollama/ollama/make subdir. - modify the cuda_legacy_v11.Makefile - where it say CUDA_ARCHITECHTURE=....... - delete the numbuer following the = and replace with 35;37;50;52 - save, exit file, nav back to /ollama/ollama dir - run make -j 5 CUDA_ARCHITECTURES="35;37;50;52" EXTRA_GOLDFLAGS="\"-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3\" \"-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=5\"" - run ./ollama
Author
Owner

@wonka929 commented on GitHub (Jan 23, 2025):

Here I am, for the same issue.
I'm on Manjaro with the last stable release.
Nvidia K3100M on the device. 4GB of ram.
Last "kind of accepted" driver is 470, but i'm not totally sure. BTW i have them installed and they'r working.
Compatible CUDA with 470 is 11.4 as seen from nvidia-smi
Kernel 6.12.

Manjaro does not have cuda11 in repos, so i downloaded them from https://archive.archlinux.org/packages/c/ (gcc10, cuda11.4, cuda-tools 11.4)

  • downloaded ollama from git curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz in ~/Downloads
  • extracted
  • cd ./ollama/make
  • no cuda_legacy_v11.Makefile found
  • found Makefile.cuda_v11
  • edited so that CUDA_ARCHITECTURES?= 35;37;50;52
  • edited ./discover/gpu.go with var ( CudaComputeMajorMin = "3" CudaComputeMinorMin = "0" )
  • cd ..
  • make or make -j CUDA_ARCHITECTURES="35;37;50;52"

./ollama serve

[GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-01-23T21:44:25.154+01:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.7-4-gca2f984-dirty)"
time=2025-01-23T21:44:25.155+01:00 level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cpu]"
time=2025-01-23T21:44:25.155+01:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-01-23T21:44:25.226+01:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-c1ad423e-7036-3411-b1c1-34aef7900dc7 library=cuda variant=v11 compute=3.0 driver=11.4 name="Quadro K3100M" total="3.9 GiB" available="3.6 GiB"

but still not working.

Any idea? @ShadowGallery93

<!-- gh-comment-id:2610982336 --> @wonka929 commented on GitHub (Jan 23, 2025): Here I am, for the same issue. I'm on Manjaro with the last stable release. Nvidia K3100M on the device. 4GB of ram. Last "kind of accepted" driver is 470, but i'm not totally sure. BTW i have them installed and they'r working. Compatible CUDA with 470 is 11.4 as seen from nvidia-smi Kernel 6.12. Manjaro does not have cuda11 in repos, so i downloaded them from https://archive.archlinux.org/packages/c/ (gcc10, cuda11.4, cuda-tools 11.4) - downloaded ollama from git `curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz` in ~/Downloads - extracted - cd ./ollama/make - no cuda_legacy_v11.Makefile found - found Makefile.cuda_v11 - edited so that CUDA_ARCHITECTURES?= 35;37;50;52 - edited ./discover/gpu.go with `var ( CudaComputeMajorMin = "3" CudaComputeMinorMin = "0" )` - cd .. - make or make -j CUDA_ARCHITECTURES="35;37;50;52" ./ollama serve > [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) > time=2025-01-23T21:44:25.154+01:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.7-4-gca2f984-dirty)" > time=2025-01-23T21:44:25.155+01:00 level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cpu]" > time=2025-01-23T21:44:25.155+01:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" > time=2025-01-23T21:44:25.226+01:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-c1ad423e-7036-3411-b1c1-34aef7900dc7 library=cuda variant=v11 compute=3.0 driver=11.4 name="Quadro K3100M" total="3.9 GiB" available="3.6 GiB" but still not working. Any idea? @ShadowGallery93
Author
Owner

@quq233 commented on GitHub (Jan 24, 2025):

Here I am, for the same issue. I'm on Manjaro with the last stable release. Nvidia K3100M on the device. 4GB of ram. Last "kind of accepted" driver is 470, but i'm not totally sure. BTW i have them installed and they'r working. Compatible CUDA with 470 is 11.4 as seen from nvidia-smi Kernel 6.12.

Manjaro does not have cuda11 in repos, so i downloaded them from https://archive.archlinux.org/packages/c/ (gcc10, cuda11.4, cuda-tools 11.4)

  • downloaded ollama from git curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz in ~/Downloads
  • extracted
  • cd ./ollama/make
  • no cuda_legacy_v11.Makefile found
  • found Makefile.cuda_v11
  • edited so that CUDA_ARCHITECTURES?= 35;37;50;52
  • edited ./discover/gpu.go with var ( CudaComputeMajorMin = "3" CudaComputeMinorMin = "0" )
  • cd ..
  • make or make -j CUDA_ARCHITECTURES="35;37;50;52"

./ollama serve

[GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-01-23T21:44:25.154+01:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.7-4-gca2f984-dirty)"
time=2025-01-23T21:44:25.155+01:00 level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cpu]"
time=2025-01-23T21:44:25.155+01:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-01-23T21:44:25.226+01:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-c1ad423e-7036-3411-b1c1-34aef7900dc7 library=cuda variant=v11 compute=3.0 driver=11.4 name="Quadro K3100M" total="3.9 GiB" available="3.6 GiB"

but still not working.

Any idea? @ShadowGallery93

this worked for me: sudo make -j 8 CUDA_11_PATH=/usr/local/cuda-11.4 CUSTOM_CPU_FLAGS="" CUDA_ARCHITECTURES="35;37;50;52" EXTRA_GOLDFLAGS=""-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3" "-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=5"" PATH="/path/to/your/go/bin:$PATH"

<!-- gh-comment-id:2611274788 --> @quq233 commented on GitHub (Jan 24, 2025): > Here I am, for the same issue. I'm on Manjaro with the last stable release. Nvidia K3100M on the device. 4GB of ram. Last "kind of accepted" driver is 470, but i'm not totally sure. BTW i have them installed and they'r working. Compatible CUDA with 470 is 11.4 as seen from nvidia-smi Kernel 6.12. > > Manjaro does not have cuda11 in repos, so i downloaded them from https://archive.archlinux.org/packages/c/ (gcc10, cuda11.4, cuda-tools 11.4) > > * downloaded ollama from git `curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz` in ~/Downloads > * extracted > * cd ./ollama/make > * no cuda_legacy_v11.Makefile found > * found Makefile.cuda_v11 > * edited so that CUDA_ARCHITECTURES?= 35;37;50;52 > * edited ./discover/gpu.go with `var ( CudaComputeMajorMin = "3" CudaComputeMinorMin = "0" )` > * cd .. > * make or make -j CUDA_ARCHITECTURES="35;37;50;52" > > ./ollama serve > > > [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) > > time=2025-01-23T21:44:25.154+01:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.7-4-gca2f984-dirty)" > > time=2025-01-23T21:44:25.155+01:00 level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cpu]" > > time=2025-01-23T21:44:25.155+01:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" > > time=2025-01-23T21:44:25.226+01:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-c1ad423e-7036-3411-b1c1-34aef7900dc7 library=cuda variant=v11 compute=3.0 driver=11.4 name="Quadro K3100M" total="3.9 GiB" available="3.6 GiB" > > but still not working. > > Any idea? [@ShadowGallery93](https://github.com/ShadowGallery93) this worked for me: sudo make -j 8 CUDA_11_PATH=/usr/local/cuda-11.4 CUSTOM_CPU_FLAGS="" CUDA_ARCHITECTURES="35;37;50;52" EXTRA_GOLDFLAGS="\"-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3\" \"-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=5\"" PATH="/path/to/your/go/bin:$PATH"
Author
Owner

@wonka929 commented on GitHub (Jan 24, 2025):

@quq233 still having issues
both with CC not set (so gcc-14) and CC set to gcc-10

`$ make CC=gcc-10 CUDA_11_PATH=/opt/cuda CUSTOM_CPU_FLAGS="" CUDA_ARCHITECTURES="35;37;50;52" EXTRA_GOLDFLAGS=""-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3" "-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=5"" PATH="/usr/bin/go:$PATH"

make[1]: Nessuna operazione da eseguire per «cpu».
/opt/cuda/bin/nvcc -c -Xcompiler -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=c++17 -Xcompiler "" -t2 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_USE_CUDA=1 -DGGML_SHARED=1 -DGGML_BACKEND_SHARED=1 -DGGML_BUILD=1 -DGGML_BACKEND_BUILD=1 -DGGML_USE_LLAMAFILE -DK_QUANTS_PER_ITERATION=2 -DNDEBUG -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Wno-deprecated-gpu-targets --forward-unknown-to-host-compiler -use_fast_math -I./llama/ -O3 --generate-code=arch=compute_35,code=[compute_35,sm_35] --generate-code=arch=compute_37,code=[compute_37,sm_37] --generate-code=arch=compute_50,code=[compute_50,sm_50] --generate-code=arch=compute_52,code=[compute_52,sm_52] -DGGML_CUDA_USE_GRAPHS=1 -o llama/build/linux-amd64/llama/ggml-cuda/ggml-cuda.cuda_v11.o llama/ggml-cuda/ggml-cuda.cu
gcc: internal compiler error: Errore di segmentazione signal terminated program cc1plus
Please submit a full bug report,
with preprocessed source if appropriate.
See https://bugs.archlinux.org/ for instructions.
make[1]: *** [make/gpu.make:53: llama/build/linux-amd64/llama/ggml-cuda/ggml-cuda.cuda_v11.o] Error 255
make: *** [Makefile:48: cuda_v11] Error 2
`

<!-- gh-comment-id:2613422737 --> @wonka929 commented on GitHub (Jan 24, 2025): @quq233 still having issues both with CC not set (so gcc-14) and CC set to gcc-10 `$ make CC=gcc-10 CUDA_11_PATH=/opt/cuda CUSTOM_CPU_FLAGS="" CUDA_ARCHITECTURES="35;37;50;52" EXTRA_GOLDFLAGS=""-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3" "-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=5"" PATH="/usr/bin/go:$PATH" make[1]: Nessuna operazione da eseguire per «cpu». /opt/cuda/bin/nvcc -c -Xcompiler -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=c++17 -Xcompiler "" -t2 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_USE_CUDA=1 -DGGML_SHARED=1 -DGGML_BACKEND_SHARED=1 -DGGML_BUILD=1 -DGGML_BACKEND_BUILD=1 -DGGML_USE_LLAMAFILE -DK_QUANTS_PER_ITERATION=2 -DNDEBUG -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Wno-deprecated-gpu-targets --forward-unknown-to-host-compiler -use_fast_math -I./llama/ -O3 --generate-code=arch=compute_35,code=[compute_35,sm_35] --generate-code=arch=compute_37,code=[compute_37,sm_37] --generate-code=arch=compute_50,code=[compute_50,sm_50] --generate-code=arch=compute_52,code=[compute_52,sm_52] -DGGML_CUDA_USE_GRAPHS=1 -o llama/build/linux-amd64/llama/ggml-cuda/ggml-cuda.cuda_v11.o llama/ggml-cuda/ggml-cuda.cu gcc: internal compiler error: Errore di segmentazione signal terminated program cc1plus Please submit a full bug report, with preprocessed source if appropriate. See <https://bugs.archlinux.org/> for instructions. make[1]: *** [make/gpu.make:53: llama/build/linux-amd64/llama/ggml-cuda/ggml-cuda.cuda_v11.o] Error 255 make: *** [Makefile:48: cuda_v11] Error 2 `
Author
Owner

@wonka929 commented on GitHub (Jan 27, 2025):

@ShadowGallery93 i managed to do it.

Final command:
make CUDA_11_PATH=/opt/cuda CUDA_ARCHITECTURES="30;35;37;50;52" EXTRA_GOLDFLAGS=""-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3" "-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=0"" PATH="/usr/bin/go:$PATH"

but i had to do:

sudo ln -sf /usr/bin/gcc-10 /usr/bin/gcc
sudo ln -sf /usr/bin/g++-10 /usr/bin/g++

to link the proper compiler because using CC=gcc-10 and GCC=g++-10 flags wasn't working.

Well....now i have an other problem which is that, after an ./ollama serve, if i try to communicate with ollama run it goes to the chat correctly but when i send a message it responds with:

Error: POST predict: Post "http://127.0.0.1:37661/completion": EOF

With this log from server:

time=2025-01-27T14:11:16.404+01:00 level=DEBUG source=sched.go:407 msg="context for request finished" time=2025-01-27T14:11:16.404+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/francesco/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff duration=5m0s time=2025-01-27T14:11:16.404+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/home/francesco/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff refCount=0 time=2025-01-27T14:11:16.414+01:00 level=DEBUG source=server.go:416 msg="llama runner terminated" error="exit status 2"

No idea of the reason....

<!-- gh-comment-id:2615731383 --> @wonka929 commented on GitHub (Jan 27, 2025): @ShadowGallery93 i managed to do it. Final command: `make CUDA_11_PATH=/opt/cuda CUDA_ARCHITECTURES="30;35;37;50;52" EXTRA_GOLDFLAGS=""-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3" "-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=0"" PATH="/usr/bin/go:$PATH"` but i had to do: sudo ln -sf /usr/bin/gcc-10 /usr/bin/gcc sudo ln -sf /usr/bin/g++-10 /usr/bin/g++ to link the proper compiler because using CC=gcc-10 and GCC=g++-10 flags wasn't working. Well....now i have an other problem which is that, after an ./ollama serve, if i try to communicate with ollama run it goes to the chat correctly but when i send a message it responds with: `Error: POST predict: Post "http://127.0.0.1:37661/completion": EOF` With this log from server: `time=2025-01-27T14:11:16.404+01:00 level=DEBUG source=sched.go:407 msg="context for request finished" time=2025-01-27T14:11:16.404+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/francesco/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff duration=5m0s time=2025-01-27T14:11:16.404+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/home/francesco/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff refCount=0 time=2025-01-27T14:11:16.414+01:00 level=DEBUG source=server.go:416 msg="llama runner terminated" error="exit status 2"` No idea of the reason....
Author
Owner

@joao-le commented on GitHub (Jan 27, 2025):

I have a K40c (computer capability 3.5, supported through the NVIDIA 470.xx Legacy drivers) and it was very easy to install with an Ubuntu 20.04 (it didn't work, on the contrary, on a Fedora 41)
Procedure:

install CUDA Toolkit 11.4 (basically follow the procedures on their page) . This also installs the drivers.

$ sudo apt install g++
$ sudo reboot

install golang (the latest version 1.23.5 worked, but not the repository version).

clone ollama and follow the procedure on Ollama's Development page:
$ git clone https://github.com/ollama/ollama.git
$ cd ollama
$ make -j 5 CUDA_ARCHITECTURES="35;37;50;52" EXTRA_GOLDFLAGS=""-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3" "-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=5""


launch ollama server and the model you want to use. Ex:
$ ./ollama serve &>/dev/null
$ ./ollama run phi4

I hope this helps.

<!-- gh-comment-id:2615969099 --> @joao-le commented on GitHub (Jan 27, 2025): I have a K40c (computer capability 3.5, supported through the NVIDIA 470.xx Legacy drivers) and it was very easy to install with an Ubuntu 20.04 (it didn't work, on the contrary, on a Fedora 41) Procedure: install CUDA Toolkit 11.4 (basically follow the procedures on their [page](https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local)) . This also installs the drivers. $ sudo apt install g++ $ sudo reboot install golang (the [latest](https://golang.google.cn/doc/install) version 1.23.5 worked, but not the repository version). clone ollama and follow the procedure on Ollama's Development [page](https://github.com/ollama/ollama/blob/main/docs/development.md#older-linux-cuda-nvidia): $ git clone https://github.com/ollama/ollama.git $ cd ollama $ make -j 5 CUDA_ARCHITECTURES="35;37;50;52" EXTRA_GOLDFLAGS="\"-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3\" \"-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=5\"" ---- launch ollama server and the model you want to use. Ex: $ ./ollama serve &>/dev/null $ ./ollama run phi4 I hope this helps.
Author
Owner

@wonka929 commented on GitHub (Jan 27, 2025):

Yep maybe i figured out.
My GPU is K3100M.
Seems that it supports just 3.0 as cuda capability.
I managed to compile with 3.0 capability but i always get the parsing error from API with 3.0.
I don't know, maybe 3.0 is just far too old to work.
Could it be?

<!-- gh-comment-id:2616090266 --> @wonka929 commented on GitHub (Jan 27, 2025): Yep maybe i figured out. My GPU is K3100M. Seems that it supports just 3.0 as cuda capability. I managed to compile with 3.0 capability but i always get the parsing error from API with 3.0. I don't know, maybe 3.0 is just far too old to work. Could it be?
Author
Owner

@sanchez314c commented on GitHub (Feb 17, 2025):

Can anyone clarify these steps for someone is not a shell-guru?

I'm working with a fresh install of 24.04. Blank slate. uBuntu picks up the K80 automatically and loads the driver.

I'm not contesting that the above don't work but there are clearly some missing steps or things to check for the laments out there like me. If anyone can provide some clarity it would be greatly appreciated.

<!-- gh-comment-id:2661926130 --> @sanchez314c commented on GitHub (Feb 17, 2025): Can anyone clarify these steps for someone is not a shell-guru? I'm working with a fresh install of 24.04. Blank slate. uBuntu picks up the K80 automatically and loads the driver. I'm not contesting that the above don't work but there are clearly some missing steps or things to check for the laments out there like me. If anyone can provide some clarity it would be greatly appreciated.
Author
Owner

@langstonmeister commented on GitHub (Feb 17, 2025):

Do you have the CUDA drivers and toolkit installed? You will need CUDA tools 11.4, but make sure not to use the included driver - stick with the version that Ubuntu loaded for you.

Something that helped me a lot when I was getting started with the CLI is setting up the SSH from another computer. Then I can just copy and paste from my regular desktop. I'm running a headless server though, not sure exactly what your setup is.

The included documentation should be pretty copy-paste friendly.
Development documentation for Linux

Check out my comments on this thread, which should also be copy-paste friendly:
here - pt1 and
here - pt2

Hope that helps!

<!-- gh-comment-id:2662036824 --> @langstonmeister commented on GitHub (Feb 17, 2025): Do you have the CUDA drivers and toolkit installed? You will need CUDA tools 11.4, but make sure not to use the included driver - stick with the version that Ubuntu loaded for you. Something that helped me a lot when I was getting started with the CLI is setting up the SSH from another computer. Then I can just copy and paste from my regular desktop. I'm running a headless server though, not sure exactly what your setup is. The included documentation should be pretty copy-paste friendly. [Development documentation for Linux](https://github.com/ollama/ollama/blob/main/docs/development.md#linux) Check out my comments on this thread, which should also be copy-paste friendly: [here - pt1](https://github.com/ollama/ollama/issues/1756#issuecomment-2295265742) and [here - pt2](https://github.com/ollama/ollama/issues/1756#issuecomment-2295271204) Hope that helps!
Author
Owner

@Diego77648 commented on GitHub (Feb 22, 2025):

Just to update a bit in case of anyone trying to use an older GPU, the gpu.go file is in /discover/gpu.go you just need to change
var (
CudaComputeMajorMin = "5"
CudaComputeMinorMin = "0"
)

to

var (
CudaComputeMajorMin = "3"
CudaComputeMinorMin = "0"
)

and then compile ollama, then it will detect your GPU

time=2025-02-22T10:41:43.208Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-02-22T10:41:48.420Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4db0c484-9ecf-65ef-9cc9-94696b5b1ddc library=cuda variant=v11 compute=3.7 driver=11.4 name="Tesla K80" total="11.2 GiB" available="11.1 GiB"
time=2025-02-22T10:41:48.420Z level=INFO source=types.go:130 msg="inference compute" id=GPU-6ac060d1-f8a6-124e-0541-e15257875097 library=cuda variant=v11 compute=3.7 driver=11.4 name="Tesla K80" total="11.2 GiB" available="11.1 GiB

<!-- gh-comment-id:2676144462 --> @Diego77648 commented on GitHub (Feb 22, 2025): Just to update a bit in case of anyone trying to use an older GPU, the gpu.go file is in /discover/gpu.go you just need to change var ( CudaComputeMajorMin = "5" CudaComputeMinorMin = "0" ) to var ( CudaComputeMajorMin = "3" CudaComputeMinorMin = "0" ) and then compile ollama, then it will detect your GPU time=2025-02-22T10:41:43.208Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-02-22T10:41:48.420Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4db0c484-9ecf-65ef-9cc9-94696b5b1ddc library=cuda variant=v11 compute=3.7 driver=11.4 name="Tesla K80" total="11.2 GiB" available="11.1 GiB" time=2025-02-22T10:41:48.420Z level=INFO source=types.go:130 msg="inference compute" id=GPU-6ac060d1-f8a6-124e-0541-e15257875097 library=cuda variant=v11 compute=3.7 driver=11.4 name="Tesla K80" total="11.2 GiB" available="11.1 GiB
Author
Owner

@wonka929 commented on GitHub (Feb 22, 2025):

Yes, true, but beware!
With compute module 3.0 it won't compile.
Real minimum requirement is 3.5

var (
CudaComputeMajorMin = "3"
CudaComputeMinorMin = "5"
)

<!-- gh-comment-id:2676149013 --> @wonka929 commented on GitHub (Feb 22, 2025): Yes, true, but beware! With compute module 3.0 it won't compile. Real minimum requirement is 3.5 var ( CudaComputeMajorMin = "3" CudaComputeMinorMin = "5" )
Author
Owner

@idream3000 commented on GitHub (Mar 3, 2025):

Just to update a bit in case of anyone trying to use an older GPU, the gpu.go file is in /discover/gpu.go you just need to change var ( CudaComputeMajorMin = "5" CudaComputeMinorMin = "0" )

to

var ( CudaComputeMajorMin = "3" CudaComputeMinorMin = "0" )

and then compile ollama, then it will detect your GPU

time=2025-02-22T10:41:43.208Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-02-22T10:41:48.420Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4db0c484-9ecf-65ef-9cc9-94696b5b1ddc library=cuda variant=v11 compute=3.7 driver=11.4 name="Tesla K80" total="11.2 GiB" available="11.1 GiB" time=2025-02-22T10:41:48.420Z level=INFO source=types.go:130 msg="inference compute" id=GPU-6ac060d1-f8a6-124e-0541-e15257875097 library=cuda variant=v11 compute=3.7 driver=11.4 name="Tesla K80" total="11.2 GiB" available="11.1 GiB

nvidia-smi : No running processes found
ollama ps : 100% GPU
but the truth is 100% CPU usage!

<!-- gh-comment-id:2693650923 --> @idream3000 commented on GitHub (Mar 3, 2025): > Just to update a bit in case of anyone trying to use an older GPU, the gpu.go file is in /discover/gpu.go you just need to change var ( CudaComputeMajorMin = "5" CudaComputeMinorMin = "0" ) > > to > > var ( CudaComputeMajorMin = "3" CudaComputeMinorMin = "0" ) > > and then compile ollama, then it will detect your GPU > > time=2025-02-22T10:41:43.208Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-02-22T10:41:48.420Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4db0c484-9ecf-65ef-9cc9-94696b5b1ddc library=cuda variant=v11 compute=3.7 driver=11.4 name="Tesla K80" total="11.2 GiB" available="11.1 GiB" time=2025-02-22T10:41:48.420Z level=INFO source=types.go:130 msg="inference compute" id=GPU-6ac060d1-f8a6-124e-0541-e15257875097 library=cuda variant=v11 compute=3.7 driver=11.4 name="Tesla K80" total="11.2 GiB" available="11.1 GiB nvidia-smi : No running processes found ollama ps : 100% GPU but the truth is 100% CPU usage!
Author
Owner

@idream3000 commented on GitHub (Mar 5, 2025):

Finally done!
share at:
https://github.com/idream3000/ollama37.git

<!-- gh-comment-id:2699931466 --> @idream3000 commented on GitHub (Mar 5, 2025): Finally done! share at: https://github.com/idream3000/ollama37.git
Author
Owner

@webclinic017 commented on GitHub (Mar 7, 2025):

@idream3000 is there a windows build for the same

<!-- gh-comment-id:2707219245 --> @webclinic017 commented on GitHub (Mar 7, 2025): @idream3000 is there a windows build for the same
Author
Owner

@dogkeeper886 commented on GitHub (Apr 3, 2025):

Finally done! share at: https://github.com/idream3000/ollama37.git

You're a legend! Your Git and hints are seriously helping me out. GCC-10 is a lifesaver.

<!-- gh-comment-id:2776475131 --> @dogkeeper886 commented on GitHub (Apr 3, 2025): > Finally done! share at: https://github.com/idream3000/ollama37.git You're a legend! Your Git and hints are seriously helping me out. GCC-10 is a lifesaver.
Author
Owner

@aecium commented on GitHub (Apr 9, 2025):

I have gotten @idream3000 's custom repo to build and use my k80 following the below steps. I have built this a few times to test but if you run into problems let me know.

@idream3000 absolute legend for the record.


I did a fresh install of Ubuntu 22.04 (I unpluged my K80 during the install so it would not install nvidai drivers)

once install is complete I updated

$ sudo apt update
$ sudo apt upgrade

grab gcc-10 which will be needed later on for compiling ollama37
$ sudo apt install gcc-10

remove default gcc
$ sudo rm /usr/bin/gcc

set default gcc to gcc-10 by creating a symlink
$ sudo ln -s /usr/bin/gcc-10 /usr/bin/gcc

$ sudo apt install g++-10

set g++ to g++ 10
$ sudo rm /usr/bin/g++
$ sudo ln -s /usr/bin/g++-10 /usr/bin/g++

$ sudo apt install cmake
$ sudo snap install go --classic
you want go version go1.24.2 linux/amd64 you will need to reboot before you can run go version

remove all nvidia drivers if there are any. I unplugged my K80 during the Ubuntu install so there were none to remove

download cuda_11.4.0_470.42.01_linux.run using wget command below or from https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local
$ wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/cuda_11.4.0_470.42.01_linux.run
$ sudo sh cuda_11.4.0_470.42.01_linux.run

If the installer warns you about nvidia drivers already installed cancel and remove/purg all nvidia drivers.
when the install asks what you want to install deselect everything but the CUDA toolkit

once the CUDA toolkit is installed install the Nvidia 470 server driver

$ sudo apt install nvidia-driver-470-server

reboot

after the reboot run nvidia-smi to check that it sees your k80

$ nvidia-smi
Mon Mar 17 15:52:16 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:03:00.0 Off | 0 |
| N/A 27C P8 27W / 149W | 4MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:04:00.0 Off | 0 |
| N/A 32C P8 26W / 149W | 4MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 936 G /usr/lib/xorg/Xorg 3MiB |
| 1 N/A N/A 936 G /usr/lib/xorg/Xorg 3MiB |
+-----------------------------------------------------------------------------+

add the cuda 11.4 nvcc temporally to your PATH var this will need to be re-run if you reboot or open a new terminal
$ export PATH=${PATH}:/usr/local/cuda-11.4/bin/
$ cd ollama37
$ cmake -B build
you want to see "-- Looking for a CUDA compiler - /usr/local/cuda-11.4/bin/nvcc" towards the end output if you don't make sure /usr/local/cuda-11.4/bin/nvcc exists and /usr/local/cuda-11.4/bin/ is in your PATH

$ cmake --build build

when that is done run
$ go run . serve
in another terminal run
$ go run . run llama3
if you want to monitor the K80 gpus you can run in a 3rd terminal
$ watch nvidia-smi
example output you can see that GPU 1 is at 93% while llama3 is answering my question "Tell me about the moon"
Tue Apr 8 17:51:25 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:03:00.0 Off | 0 |
| N/A 45C P8 27W / 149W | 3485MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:04:00.0 Off | 0 |
| N/A 45C P0 148W / 149W | 6114MiB / 11441MiB | 93% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 949 G /usr/lib/xorg/Xorg 3MiB |
| 0 N/A N/A 201371 C ...092e7b06ed6b05be-d/ollama 3477MiB |
| 1 N/A N/A 949 G /usr/lib/xorg/Xorg 3MiB |
| 1 N/A N/A 204890 C ...092e7b06ed6b05be-d/ollama 6107MiB |
+-----------------------------------------------------------------------------+

<!-- gh-comment-id:2790192935 --> @aecium commented on GitHub (Apr 9, 2025): I have gotten @idream3000 's custom repo to build and use my k80 following the below steps. I have built this a few times to test but if you run into problems let me know. @idream3000 absolute legend for the record. ----------------------------------------- I did a fresh install of Ubuntu 22.04 (I unpluged my K80 during the install so it would not install nvidai drivers) once install is complete I updated $ sudo apt update $ sudo apt upgrade grab gcc-10 which will be needed later on for compiling ollama37 $ sudo apt install gcc-10 remove default gcc $ sudo rm /usr/bin/gcc set default gcc to gcc-10 by creating a symlink $ sudo ln -s /usr/bin/gcc-10 /usr/bin/gcc $ sudo apt install g++-10 set g++ to g++ 10 $ sudo rm /usr/bin/g++ $ sudo ln -s /usr/bin/g++-10 /usr/bin/g++ $ sudo apt install cmake $ sudo snap install go --classic you want go version go1.24.2 linux/amd64 you will need to reboot before you can run go version remove all nvidia drivers if there are any. I unplugged my K80 during the Ubuntu install so there were none to remove download cuda_11.4.0_470.42.01_linux.run using wget command below or from https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local $ wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/cuda_11.4.0_470.42.01_linux.run $ sudo sh cuda_11.4.0_470.42.01_linux.run If the installer warns you about nvidia drivers already installed cancel and remove/purg all nvidia drivers. when the install asks what you want to install deselect everything but the CUDA toolkit once the CUDA toolkit is installed install the Nvidia 470 server driver $ sudo apt install nvidia-driver-470-server reboot after the reboot run nvidia-smi to check that it sees your k80 $ nvidia-smi Mon Mar 17 15:52:16 2025 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:03:00.0 Off | 0 | | N/A 27C P8 27W / 149W | 4MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 00000000:04:00.0 Off | 0 | | N/A 32C P8 26W / 149W | 4MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 936 G /usr/lib/xorg/Xorg 3MiB | | 1 N/A N/A 936 G /usr/lib/xorg/Xorg 3MiB | +-----------------------------------------------------------------------------+ add the cuda 11.4 nvcc temporally to your PATH var **this will need to be re-run if you reboot or open a new terminal** $ export PATH=${PATH}:/usr/local/cuda-11.4/bin/ $ cd ollama37 $ cmake -B build you want to see "-- Looking for a CUDA compiler - /usr/local/cuda-11.4/bin/nvcc" towards the end output if you don't make sure /usr/local/cuda-11.4/bin/nvcc exists and /usr/local/cuda-11.4/bin/ is in your PATH $ cmake --build build when that is done run $ go run . serve in another terminal run $ go run . run llama3 if you want to monitor the K80 gpus you can run in a 3rd terminal $ watch nvidia-smi example output you can see that GPU 1 is at 93% while llama3 is answering my question "Tell me about the moon" Tue Apr 8 17:51:25 2025 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:03:00.0 Off | 0 | | N/A 45C P8 27W / 149W | 3485MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 00000000:04:00.0 Off | 0 | | N/A 45C P0 148W / 149W | 6114MiB / 11441MiB | 93% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 949 G /usr/lib/xorg/Xorg 3MiB | | 0 N/A N/A 201371 C ...092e7b06ed6b05be-d/ollama 3477MiB | | 1 N/A N/A 949 G /usr/lib/xorg/Xorg 3MiB | | 1 N/A N/A 204890 C ...092e7b06ed6b05be-d/ollama 6107MiB | +-----------------------------------------------------------------------------+
Author
Owner

@MORITZ0405 commented on GitHub (Apr 17, 2025):

is there a build for windows

<!-- gh-comment-id:2811660840 --> @MORITZ0405 commented on GitHub (Apr 17, 2025): is there a build for windows
Author
Owner

@aecium commented on GitHub (Apr 17, 2025):

is there a build for windows

@MORITZ0405 in theory if the tool chain needed to build it, gcc-10, g++-10, Cmake, and such have versions that run on windows you should be able to build this in the same manner on windows.

However you might find it easier to follow steps laid out above by setting up a local VM using Vbox or hyper-V on your windows system that has the K80 in it running Ubuntu 22.04.

here is a link to documentation for setting up Ubuntu Vbox 7 https://ubuntu.com/tutorials/how-to-run-ubuntu-desktop-on-a-virtual-machine-using-virtualbox#1-overview

And here is one for the same on HyperV https://documentation.ubuntu.com/server/how-to/virtualisation/ubuntu-on-hyper-v/index.html

I just did a quick search for those and did not look through them but should be a good place to start.

Hope that helps

<!-- gh-comment-id:2814218429 --> @aecium commented on GitHub (Apr 17, 2025): > is there a build for windows @MORITZ0405 in theory if the tool chain needed to build it, gcc-10, g++-10, Cmake, and such have versions that run on windows you should be able to build this in the same manner on windows. However you might find it easier to follow steps laid out above by setting up a local VM using Vbox or hyper-V on your windows system that has the K80 in it running Ubuntu 22.04. here is a link to documentation for setting up Ubuntu Vbox 7 https://ubuntu.com/tutorials/how-to-run-ubuntu-desktop-on-a-virtual-machine-using-virtualbox#1-overview And here is one for the same on HyperV https://documentation.ubuntu.com/server/how-to/virtualisation/ubuntu-on-hyper-v/index.html I just did a quick search for those and did not look through them but should be a good place to start. Hope that helps
Author
Owner

@dogkeeper886 commented on GitHub (Apr 18, 2025):

If you don't mind using docker image. I personally use the Linux, don't have widows environment.

https://hub.docker.com/r/dogkeeper886/ollama37

<!-- gh-comment-id:2814499659 --> @dogkeeper886 commented on GitHub (Apr 18, 2025): If you don't mind using docker image. I personally use the Linux, don't have widows environment. https://hub.docker.com/r/dogkeeper886/ollama37
Author
Owner

@hdnh2006 commented on GitHub (Apr 18, 2025):

If you don't mind using docker image. I personally use the Linux, don't have widows environment.

https://hub.docker.com/r/dogkeeper886/ollama37

thank you mate! you are the best!!

<!-- gh-comment-id:2815096610 --> @hdnh2006 commented on GitHub (Apr 18, 2025): > If you don't mind using docker image. I personally use the Linux, don't have widows environment. > > https://hub.docker.com/r/dogkeeper886/ollama37 thank you mate! you are the best!!
Author
Owner

@LeGuipo commented on GitHub (Apr 22, 2025):

So I just did Diego77648’s trick and it works. My GTX 780 SC is definitely NOT TOO OLD ANYMORE 🥳
Though I cannot run more than 3B of size models with it, text generation time is more than halved.
I’m happy to not have to waste money in another GPU just for toying with ollama.

<!-- gh-comment-id:2822067531 --> @LeGuipo commented on GitHub (Apr 22, 2025): So I just did Diego77648’s trick and it works. My GTX 780 SC is definitely NOT TOO OLD ANYMORE 🥳 Though I cannot run more than 3B of size models with it, text generation time is more than halved. I’m happy to not have to waste money in another GPU just for toying with ollama.
Author
Owner

@KeemOnGithub commented on GitHub (Aug 19, 2025):

Finally done! share at: https://github.com/idream3000/ollama37.git

I tried building this on Windows using go build ., but the .exe file generated won't open. Does anyone know why?

<!-- gh-comment-id:3199782678 --> @KeemOnGithub commented on GitHub (Aug 19, 2025): > Finally done! share at: https://github.com/idream3000/ollama37.git I tried building this on Windows using ```go build .```, but the .exe file generated won't open. Does anyone know why?
Author
Owner

@aecium commented on GitHub (Aug 19, 2025):

Finally done! share at: https://github.com/idream3000/ollama37.git

I tried building this on Windows using go build ., but the .exe file generated won't open. Does anyone know why?

Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve'

<!-- gh-comment-id:3200538365 --> @aecium commented on GitHub (Aug 19, 2025): > > Finally done! share at: https://github.com/idream3000/ollama37.git > > I tried building this on Windows using ```go build .```, but the .exe file generated won't open. Does anyone know why? Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve'
Author
Owner

@idream3000 commented on GitHub (Aug 19, 2025):

Please try wsl on windows!Sent from my iPhoneOn Aug 19, 2025, at 20:22, Aecium @.***> wrote:aecium left a comment (ollama/ollama#1756)

Finally done! share at: https://github.com/idream3000/ollama37.git

I tried building this on Windows using go build ., but the .exe file generated won't open. Does anyone know why?

Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve'

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

<!-- gh-comment-id:3200804239 --> @idream3000 commented on GitHub (Aug 19, 2025): Please try wsl on windows!Sent from my iPhoneOn Aug 19, 2025, at 20:22, Aecium ***@***.***> wrote:aecium left a comment (ollama/ollama#1756) Finally done! share at: https://github.com/idream3000/ollama37.git I tried building this on Windows using go build ., but the .exe file generated won't open. Does anyone know why? Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve' —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
Author
Owner

@KeemOnGithub commented on GitHub (Aug 20, 2025):

Please try wsl on windows!Sent from my iPhoneOn Aug 19, 2025, at 20:22, Aecium @.***> wrote:aecium left a comment (ollama/ollama#1756)

Finally done! share at: https://github.com/idream3000/ollama37.git

I tried building this on Windows using go build ., but the .exe file generated won't open. Does anyone know why?

Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve'

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

It compiles, thank you! I can't get it to recognize my GPUs though... I think the latest drivers for Tesla k20c don't support wsl2.

<!-- gh-comment-id:3204277322 --> @KeemOnGithub commented on GitHub (Aug 20, 2025): > Please try wsl on windows!Sent from my iPhoneOn Aug 19, 2025, at 20:22, Aecium ***@***.***> wrote:aecium left a comment ([ollama/ollama#1756](https://github.com/ollama/ollama/issues/1756)) > > > Finally done! share at: https://github.com/idream3000/ollama37.git > > I tried building this on Windows using go build ., but the .exe file generated won't open. Does anyone know why? > > Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve' > > —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***> It compiles, thank you! I can't get it to recognize my GPUs though... I think the latest drivers for Tesla k20c don't support wsl2.
Author
Owner

@KeemOnGithub commented on GitHub (Aug 20, 2025):

Finally done! share at: https://github.com/idream3000/ollama37.git

I tried building this on Windows using go build ., but the .exe file generated won't open. Does anyone know why?

Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve'

Ah yes, go run . serve does work. An .exe would be more convenient for me, but this is fine for testing.

Thanks!

<!-- gh-comment-id:3204278867 --> @KeemOnGithub commented on GitHub (Aug 20, 2025): > > > Finally done! share at: https://github.com/idream3000/ollama37.git > > > > > > I tried building this on Windows using `go build .`, but the .exe file generated won't open. Does anyone know why? > > Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve' Ah yes, go run . serve does work. An .exe would be more convenient for me, but this is fine for testing. Thanks!
Author
Owner

@idream3000 commented on GitHub (Aug 20, 2025):

My mod is only work for K80(compute capability 3.7),you can modify the code for K20C by yourself simply add “3.5;” before “3.7;” at the modified file.

2025年8月20日 13:46,KeemOnGithub @.***> 写道:

KeemOnGithub
left a comment
(ollama/ollama#1756)
https://github.com/ollama/ollama/issues/1756#issuecomment-3204277322
Please try wsl on windows!Sent from my iPhoneOn Aug 19, 2025, at 20:22, Aecium @.***> wrote:aecium left a comment (ollama/ollama#1756 https://github.com/ollama/ollama/issues/1756)

Finally done! share at: https://github.com/idream3000/ollama37.git

I tried building this on Windows using go build ., but the .exe file generated won't open. Does anyone know why?

Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve'

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

It compiles, thank you! I can't get it to recognize my GPUs though... I think the latest drivers for Tesla k20c don't support wsl2.


Reply to this email directly, view it on GitHub https://github.com/ollama/ollama/issues/1756#issuecomment-3204277322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2AF4KQSIUBZTYLKDC5XZT3OQDSTAVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEMBUGI3TOMZSGI.
You are receiving this because you were mentioned.

<!-- gh-comment-id:3204311078 --> @idream3000 commented on GitHub (Aug 20, 2025): My mod is only work for K80(compute capability 3.7),you can modify the code for K20C by yourself simply add “3.5;” before “3.7;” at the modified file. > 2025年8月20日 13:46,KeemOnGithub ***@***.***> 写道: > > > KeemOnGithub > left a comment > (ollama/ollama#1756) > <https://github.com/ollama/ollama/issues/1756#issuecomment-3204277322> > Please try wsl on windows!Sent from my iPhoneOn Aug 19, 2025, at 20:22, Aecium @.***> wrote:aecium left a comment (ollama/ollama#1756 <https://github.com/ollama/ollama/issues/1756>) > > Finally done! share at: https://github.com/idream3000/ollama37.git > > I tried building this on Windows using go build ., but the .exe file generated won't open. Does anyone know why? > > Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve' > > —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***> > > It compiles, thank you! I can't get it to recognize my GPUs though... I think the latest drivers for Tesla k20c don't support wsl2. > > — > Reply to this email directly, view it on GitHub <https://github.com/ollama/ollama/issues/1756#issuecomment-3204277322>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2AF4KQSIUBZTYLKDC5XZT3OQDSTAVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEMBUGI3TOMZSGI>. > You are receiving this because you were mentioned. >
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26768