[GH-ISSUE #2187] Support GPU runners on CPUs without AVX #27010

Closed
opened 2026-04-22 03:52:39 -05:00 by GiteaMirror · 57 comments
Owner

Originally created by @jmorganca on GitHub (Jan 25, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2187

Originally assigned to: @dhiltgen on GitHub.

2024/01/25 10:13:00 gpu.go:137: INFO CUDA Compute Capability detected: 8.6
^Cuser@llm-01:~$ ollama serve
2024/01/25 10:14:17 images.go:815: INFO total blobs: 14
2024/01/25 10:14:17 images.go:822: INFO total unused blobs removed: 0
2024/01/25 10:14:17 routes.go:943: INFO Listening on 127.0.0.1:11434 (version 0.1.21)
2024/01/25 10:14:17 payload_common.go:106: INFO Extracting dynamic libraries...
2024/01/25 10:14:20 payload_common.go:145: INFO Dynamic LLM libraries [cpu cpu_avx cuda_v11 rocm_v6 rocm_v5 cpu_avx2]
2024/01/25 10:14:20 gpu.go:91: INFO Detecting GPU type
2024/01/25 10:14:20 gpu.go:210: INFO Searching for GPU management library libnvidia-ml.so
2024/01/25 10:14:20 gpu.go:256: INFO Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.29.06]
2024/01/25 10:14:20 gpu.go:96: INFO Nvidia GPU detected
2024/01/25 10:14:20 gpu.go:137: INFO CUDA Compute Capability detected: 8.6

[GIN] 2024/01/25 - 10:14:54 | 200 |     249.562µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/01/25 - 10:14:54 | 200 |     938.998µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/01/25 - 10:14:54 | 200 |     201.321µs |       127.0.0.1 | POST     "/api/show"
2024/01/25 10:14:54 gpu.go:137: INFO CUDA Compute Capability detected: 8.6
2024/01/25 10:14:54 gpu.go:137: INFO CUDA Compute Capability detected: 8.6
2024/01/25 10:14:54 cpu_common.go:18: INFO CPU does not have vector extensions
loading library /tmp/ollama1758121582/cuda_v11/libext_server.so
SIGILL: illegal instruction
PC=0x7f38ddf4248c m=15 sigcode=2
signal arrived during cgo execution

@dhiltgen this will be of interest to you

Originally created by @jmorganca on GitHub (Jan 25, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2187 Originally assigned to: @dhiltgen on GitHub. ``` 2024/01/25 10:13:00 gpu.go:137: INFO CUDA Compute Capability detected: 8.6 ^Cuser@llm-01:~$ ollama serve 2024/01/25 10:14:17 images.go:815: INFO total blobs: 14 2024/01/25 10:14:17 images.go:822: INFO total unused blobs removed: 0 2024/01/25 10:14:17 routes.go:943: INFO Listening on 127.0.0.1:11434 (version 0.1.21) 2024/01/25 10:14:17 payload_common.go:106: INFO Extracting dynamic libraries... 2024/01/25 10:14:20 payload_common.go:145: INFO Dynamic LLM libraries [cpu cpu_avx cuda_v11 rocm_v6 rocm_v5 cpu_avx2] 2024/01/25 10:14:20 gpu.go:91: INFO Detecting GPU type 2024/01/25 10:14:20 gpu.go:210: INFO Searching for GPU management library libnvidia-ml.so 2024/01/25 10:14:20 gpu.go:256: INFO Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.29.06] 2024/01/25 10:14:20 gpu.go:96: INFO Nvidia GPU detected 2024/01/25 10:14:20 gpu.go:137: INFO CUDA Compute Capability detected: 8.6 [GIN] 2024/01/25 - 10:14:54 | 200 | 249.562µs | 127.0.0.1 | HEAD "/" [GIN] 2024/01/25 - 10:14:54 | 200 | 938.998µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/01/25 - 10:14:54 | 200 | 201.321µs | 127.0.0.1 | POST "/api/show" 2024/01/25 10:14:54 gpu.go:137: INFO CUDA Compute Capability detected: 8.6 2024/01/25 10:14:54 gpu.go:137: INFO CUDA Compute Capability detected: 8.6 2024/01/25 10:14:54 cpu_common.go:18: INFO CPU does not have vector extensions loading library /tmp/ollama1758121582/cuda_v11/libext_server.so SIGILL: illegal instruction PC=0x7f38ddf4248c m=15 sigcode=2 signal arrived during cgo execution ``` @dhiltgen this will be of interest to you
GiteaMirror added the buildbug labels 2026-04-22 03:52:39 -05:00
Author
Owner

@dhiltgen commented on GitHub (Jan 25, 2024):

At present we're compiling the GPU runners with some of the matrix CPU features turned on which is the likely cause of this. I'll explore removing that and run performance tests to see if it has a negative impact.

<!-- gh-comment-id:1910877654 --> @dhiltgen commented on GitHub (Jan 25, 2024): At present we're compiling the GPU runners with some of the matrix CPU features turned on which is the likely cause of this. I'll explore removing that and run performance tests to see if it has a negative impact.
Author
Owner

@JadenSWang commented on GitHub (Jan 26, 2024):

It is quite exciting to see the errors I'm over here eating glass over being asked 20 hours earlier, guess I'm on the right path, any ideas on when this may be resolved? I'm on docker.

<!-- gh-comment-id:1911546398 --> @JadenSWang commented on GitHub (Jan 26, 2024): It is quite exciting to see the errors I'm over here eating glass over being asked 20 hours earlier, guess I'm on the right path, any ideas on when this may be resolved? I'm on docker.
Author
Owner

@dhiltgen commented on GitHub (Jan 26, 2024):

At the very least, we should detect this scenario and not load the library which will crash, and fallback to CPU to remain functional.

<!-- gh-comment-id:1912427610 --> @dhiltgen commented on GitHub (Jan 26, 2024): At the very least, we should detect this scenario and not load the library which will crash, and fallback to CPU to remain functional.
Author
Owner

@dhiltgen commented on GitHub (Jan 26, 2024):

It is quite exciting to see the errors I'm over here eating glass over being asked 20 hours earlier, guess I'm on the right path, any ideas on when this may be resolved? I'm on docker.

Until this is resolved, you can force CPU mode https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#llm-libraries

<!-- gh-comment-id:1912431425 --> @dhiltgen commented on GitHub (Jan 26, 2024): > It is quite exciting to see the errors I'm over here eating glass over being asked 20 hours earlier, guess I'm on the right path, any ideas on when this may be resolved? I'm on docker. Until this is resolved, you can force CPU mode https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#llm-libraries
Author
Owner

@dhiltgen commented on GitHub (Jan 26, 2024):

With #2214 we'll at least fallback to CPU mode and not crash. A warning in the server log will help users understand why we didn't even try to use their GPU (if present) and are running slow.

2024/01/26 19:41:40 cpu_common.go:18: INFO CPU does not have vector extensions
2024/01/26 19:41:40 gpu.go:128: WARN CPU does not have AVX or AVX2, disabling GPU support.
<!-- gh-comment-id:1912598790 --> @dhiltgen commented on GitHub (Jan 26, 2024): With #2214 we'll at least fallback to CPU mode and not crash. A warning in the server log will help users understand why we didn't even try to use their GPU (if present) and are running slow. ``` 2024/01/26 19:41:40 cpu_common.go:18: INFO CPU does not have vector extensions 2024/01/26 19:41:40 gpu.go:128: WARN CPU does not have AVX or AVX2, disabling GPU support. ```
Author
Owner

@JadenSWang commented on GitHub (Jan 27, 2024):

Wait so does this mean if I have GPUs and get this error is it that a) my GPUs are not configured properly and b) my GPUs wont be used and instead CPU will be?

<!-- gh-comment-id:1912977118 --> @JadenSWang commented on GitHub (Jan 27, 2024): Wait so does this mean if I have GPUs and get this error is it that a) my GPUs are not configured properly and b) my GPUs wont be used and instead CPU will be?
Author
Owner

@JadenSWang commented on GitHub (Jan 27, 2024):

@dhiltgen I'm not sure this is resolved, I'm still getting the same error:

2024/01/27 05:33:07 images.go:857: INFO total blobs: 14
2024/01/27 05:33:07 images.go:864: INFO total unused blobs removed: 0
2024/01/27 05:33:07 routes.go:950: INFO Listening on [::]:11434 (version 0.1.22)
2024/01/27 05:33:07 payload_common.go:106: INFO Extracting dynamic libraries...
2024/01/27 05:33:10 payload_common.go:145: INFO Dynamic LLM libraries [cpu rocm_v5 cpu_avx2 rocm_v6 cpu_avx cuda_v11]
2024/01/27 05:33:10 gpu.go:94: INFO Detecting GPU type
2024/01/27 05:33:10 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so
2024/01/27 05:33:10 gpu.go:282: INFO Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.154.05]
2024/01/27 05:33:11 gpu.go:99: INFO Nvidia GPU detected
2024/01/27 05:33:11 gpu.go:140: INFO CUDA Compute Capability detected: 8.9
[GIN] 2024/01/27 - 05:34:16 | 200 | 30.507µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/01/27 - 05:34:16 | 200 | 431.803µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/01/27 - 05:34:16 | 200 | 325.402µs | 127.0.0.1 | POST "/api/show"
2024/01/27 05:34:16 gpu.go:140: INFO CUDA Compute Capability detected: 8.9
2024/01/27 05:34:16 gpu.go:140: INFO CUDA Compute Capability detected: 8.9
2024/01/27 05:34:16 cpu_common.go:18: INFO CPU does not have vector extensions
SIGILL: illegal instruction
PC=0x7f91f823142c m=9 sigcode=2
signal arrived during cgo execution
instruction bytes: 0xc5 0xf9 0xef 0xc0 0x41 0x54 0x4c 0x8d 0x24 0xd5 0x0 0x0 0x0 0x0 0x55 0x53
goroutine 24 [syscall]:
runtime.cgocall(0x9b71c0, 0xc0000ae8a0)
/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc0000ae878 sp=0xc0000ae840 pc=0x409b0b
github.com/jmorganca/ollama/llm._Cfunc_dyn_init(0x7f9200000b70, 0xc00060e600, 0xc0002cd1b8)
_cgo_gotypes.go:190 +0x45 fp=0xc0000ae8a0 sp=0xc0000ae878 pc=0x7c3705

running: ollama/ollama:0.1.22

<!-- gh-comment-id:1913013176 --> @JadenSWang commented on GitHub (Jan 27, 2024): @dhiltgen I'm not sure this is resolved, I'm still getting the same error: > 2024/01/27 05:33:07 images.go:857: INFO total blobs: 14 > 2024/01/27 05:33:07 images.go:864: INFO total unused blobs removed: 0 > 2024/01/27 05:33:07 routes.go:950: INFO Listening on [::]:11434 (version 0.1.22) > 2024/01/27 05:33:07 payload_common.go:106: INFO Extracting dynamic libraries... > 2024/01/27 05:33:10 payload_common.go:145: INFO Dynamic LLM libraries [cpu rocm_v5 cpu_avx2 rocm_v6 cpu_avx cuda_v11] > 2024/01/27 05:33:10 gpu.go:94: INFO Detecting GPU type > 2024/01/27 05:33:10 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so > 2024/01/27 05:33:10 gpu.go:282: INFO Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.154.05] > 2024/01/27 05:33:11 gpu.go:99: INFO Nvidia GPU detected > 2024/01/27 05:33:11 gpu.go:140: INFO CUDA Compute Capability detected: 8.9 > [GIN] 2024/01/27 - 05:34:16 | 200 | 30.507µs | 127.0.0.1 | HEAD "/" > [GIN] 2024/01/27 - 05:34:16 | 200 | 431.803µs | 127.0.0.1 | POST "/api/show" > [GIN] 2024/01/27 - 05:34:16 | 200 | 325.402µs | 127.0.0.1 | POST "/api/show" > 2024/01/27 05:34:16 gpu.go:140: INFO CUDA Compute Capability detected: 8.9 > 2024/01/27 05:34:16 gpu.go:140: INFO CUDA Compute Capability detected: 8.9 > 2024/01/27 05:34:16 cpu_common.go:18: INFO CPU does not have vector extensions > SIGILL: illegal instruction > PC=0x7f91f823142c m=9 sigcode=2 > signal arrived during cgo execution > instruction bytes: 0xc5 0xf9 0xef 0xc0 0x41 0x54 0x4c 0x8d 0x24 0xd5 0x0 0x0 0x0 0x0 0x55 0x53 > goroutine 24 [syscall]: > runtime.cgocall(0x9b71c0, 0xc0000ae8a0) > /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc0000ae878 sp=0xc0000ae840 pc=0x409b0b > github.com/jmorganca/ollama/llm._Cfunc_dyn_init(0x7f9200000b70, 0xc00060e600, 0xc0002cd1b8) > _cgo_gotypes.go:190 +0x45 fp=0xc0000ae8a0 sp=0xc0000ae878 pc=0x7c3705 running: ollama/ollama:0.1.22
Author
Owner

@JadenSWang commented on GitHub (Jan 27, 2024):

I just fixed it by enabling AVX in proxmox but this seemed to still crash without AVX support

<!-- gh-comment-id:1913028587 --> @JadenSWang commented on GitHub (Jan 27, 2024): I just fixed it by enabling AVX in proxmox but this seemed to still crash without AVX support
Author
Owner

@dhiltgen commented on GitHub (Jan 27, 2024):

The fix to fallback to CPU mode when we detect no AVX support and not even try to load the GPU library was merged after we shipped 0.1.22, so it will show up in 0.1.23 when that ships.

<!-- gh-comment-id:1913202350 --> @dhiltgen commented on GitHub (Jan 27, 2024): The fix to fallback to CPU mode when we detect no AVX support and not even try to load the GPU library was merged after we shipped 0.1.22, so it will show up in 0.1.23 when that ships.
Author
Owner

@dhiltgen commented on GitHub (Jan 27, 2024):

Wait so does this mean if I have GPUs and get this error is it that a) my GPUs are not configured properly and b) my GPUs wont be used and instead CPU will be?

To clarify how this works: We compile multiple variations of the LLM native library. In particular for your scenario, we currently compile a single CUDA library and that library is compiled with AVX extensions turned on. This helps improve performance when the entire model doesn't fit on the GPU (which is quite common for larger models) and we have to fallback to partially running on the CPU. AVX is ~400% faster than no AVX. However, this means that if we load that library on a system without AVX, it will crash when those instructions are executed by the process.

What has changed in 0.1.23 (not yet shipped) is detecting this scenario and rejecting the GPU library entirely and falling back to pure CPU without AVX so that we remain functional, albeit much slower, instead of crashing. This also will report a warning in the server log to help users understand that there's a significant performance penalty due to the lack of AVX.

I highly recommend enabling the vector math extensions on your CPU virtualization system where possible.

<!-- gh-comment-id:1913211916 --> @dhiltgen commented on GitHub (Jan 27, 2024): > Wait so does this mean if I have GPUs and get this error is it that a) my GPUs are not configured properly and b) my GPUs wont be used and instead CPU will be? To clarify how this works: We compile multiple variations of the LLM native library. In particular for your scenario, we currently compile a single CUDA library and that library is compiled with AVX extensions turned on. This helps improve performance when the entire model doesn't fit on the GPU (which is quite common for larger models) and we have to fallback to partially running on the CPU. AVX is ~400% faster than no AVX. However, this means that if we load that library on a system without AVX, it will crash when those instructions are executed by the process. What has changed in 0.1.23 (not yet shipped) is detecting this scenario and rejecting the GPU library entirely and falling back to pure CPU without AVX so that we remain functional, albeit much slower, instead of crashing. This also will report a warning in the server log to help users understand that there's a significant performance penalty due to the lack of AVX. I highly recommend enabling the vector math extensions on your CPU virtualization system where possible.
Author
Owner

@Cybervet commented on GitHub (Jan 27, 2024):

So if the cpu has no AVX can not use cuda and GPU not matter what, even after compilation from source?

<!-- gh-comment-id:1913333255 --> @Cybervet commented on GitHub (Jan 27, 2024): So if the cpu has no AVX can not use cuda and GPU not matter what, even after compilation from source?
Author
Owner

@JadenSWang commented on GitHub (Jan 28, 2024):

@Cybervet yes it seems GPU support requires the AVX instruction set, luckily a lot of modern CPUs support it: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

<!-- gh-comment-id:1913444725 --> @JadenSWang commented on GitHub (Jan 28, 2024): @Cybervet yes it seems GPU support requires the AVX instruction set, luckily a lot of modern CPUs support it: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
Author
Owner

@dhiltgen commented on GitHub (Jan 28, 2024):

AVX has been around for ~13 years and I'm not aware of any modern x86 CPU that doesn't support it. The intersection of 14+ year old CPUs and a similar vintage GPU that's supported by CUDA or ROCm and useful for LLM tasks seems unlikely. The more likely scenario is a virtualization/emulation system where it's masking out those features for portability, and given the massive performance hit by not using these features of the CPU, we recommend trying to enable them. We'll at least be functional in 0.1.23, just slow.

@Cybervet to answer your question about building from source, we don't currently optimize our build configuration for this scenario but if you do have a situation that call's for this combination (CUDA support without AVX) modify the default flags we use to build llama.cpp here and take a look at the CUDA section further down in that file.

<!-- gh-comment-id:1913702552 --> @dhiltgen commented on GitHub (Jan 28, 2024): AVX has been around for ~13 years and I'm not aware of any modern x86 CPU that doesn't support it. The intersection of 14+ year old CPUs and a similar vintage GPU that's supported by CUDA or ROCm and useful for LLM tasks seems unlikely. The more likely scenario is a virtualization/emulation system where it's masking out those features for portability, and given the massive performance hit by not using these features of the CPU, we recommend trying to enable them. We'll at least be functional in 0.1.23, just slow. @Cybervet to answer your question about building from source, we don't currently optimize our build configuration for this scenario but if you do have a situation that call's for this combination (CUDA support without AVX) modify the default flags we use to build llama.cpp [here](https://github.com/ollama/ollama/blob/main/llm/generate/gen_linux.sh#L52) and take a look at the CUDA section further down in that file.
Author
Owner

@Cybervet commented on GitHub (Jan 29, 2024):

@Cybervet to answer your question about building from source, we don't currently optimize our build configuration for this scenario but if you do have a situation that call's for this combination (CUDA support without AVX) modify the default flags we use to build llama.cpp here and take a look at the CUDA section further down in that file.

Well I have a couple of HP Z800 workstations with dual XEON X5680 (12c/24T) with a 128GB ram running proxmox and I am running ollama in a linux container. The X5680 is a 2010 cpu without AVX , so I thought to use my RTX 3060 12GB on the machine to speed up llms with cuda. The cpu is old but the GPU is new. So far I have not managed to compile with custom flags nomatter what I tried, it works but in cpu only mode. Any ideas?

<!-- gh-comment-id:1913781545 --> @Cybervet commented on GitHub (Jan 29, 2024): > @Cybervet to answer your question about building from source, we don't currently optimize our build configuration for this scenario but if you do have a situation that call's for this combination (CUDA support without AVX) modify the default flags we use to build llama.cpp [here](https://github.com/ollama/ollama/blob/main/llm/generate/gen_linux.sh#L52) and take a look at the CUDA section further down in that file. Well I have a couple of HP Z800 workstations with dual XEON X5680 (12c/24T) with a 128GB ram running proxmox and I am running ollama in a linux container. The X5680 is a 2010 cpu without AVX , so I thought to use my RTX 3060 12GB on the machine to speed up llms with cuda. The cpu is old but the GPU is new. So far I have not managed to compile with custom flags nomatter what I tried, it works but in cpu only mode. Any ideas?
Author
Owner

@dhiltgen commented on GitHub (Jan 29, 2024):

@Cybervet the one other change you'll need is to alter the gpu detection logic to bypass the fairly recent check we added to skip GPUs on non-AVX systems - https://github.com/ollama/ollama/blob/main/gpu/gpu.go#L133

<!-- gh-comment-id:1913786240 --> @dhiltgen commented on GitHub (Jan 29, 2024): @Cybervet the one other change you'll need is to alter the gpu detection logic to bypass the fairly recent check we added to skip GPUs on non-AVX systems - https://github.com/ollama/ollama/blob/main/gpu/gpu.go#L133
Author
Owner

@Cybervet commented on GitHub (Jan 29, 2024):

@Cybervet the one other change you'll need is to alter the gpu detection logic to bypass the fairly recent check we added to skip GPUs on non-AVX systems - https://github.com/ollama/ollama/blob/main/gpu/gpu.go#L133

Is this the only change in the gpu.go (it doesn't seem to work) or we should also add changes to cpu_common.go I just want to see what the situation will be with no AVX and a capable GPU.

<!-- gh-comment-id:1914806318 --> @Cybervet commented on GitHub (Jan 29, 2024): > @Cybervet the one other change you'll need is to alter the gpu detection logic to bypass the fairly recent check we added to skip GPUs on non-AVX systems - https://github.com/ollama/ollama/blob/main/gpu/gpu.go#L133 Is this the only change in the gpu.go (it doesn't seem to work) or we should also add changes to cpu_common.go I just want to see what the situation will be with no AVX and a capable GPU.
Author
Owner

@dhiltgen commented on GitHub (Jan 29, 2024):

@Cybervet I believe the two changes you'll need to make are the compile flags and the gpu.go changes, but I haven't tested this scenario. You can set OLLAMA_DEBUG=1 to get more logs in your experiments to understand the flow better.

<!-- gh-comment-id:1915135128 --> @dhiltgen commented on GitHub (Jan 29, 2024): @Cybervet I believe the two changes you'll need to make are the compile flags and the gpu.go changes, but I haven't tested this scenario. You can set OLLAMA_DEBUG=1 to get more logs in your experiments to understand the flow better.
Author
Owner

@dbzoo commented on GitHub (Feb 1, 2024):

I too, ran into this problem - these changes worked for me.
45eb104849

<!-- gh-comment-id:1921444626 --> @dbzoo commented on GitHub (Feb 1, 2024): I too, ran into this problem - these changes worked for me. https://github.com/dbzoo/ollama/commit/45eb1048496780a78ed07cf39b3ce6b62b5a72e3
Author
Owner

@JadenSWang commented on GitHub (Feb 2, 2024):

@Cybervet my understanding is that you cannot use GPUs with Ollama if you don't have AVX support.

<!-- gh-comment-id:1924315447 --> @JadenSWang commented on GitHub (Feb 2, 2024): @Cybervet my understanding is that you cannot use GPUs with Ollama if you don't have AVX support.
Author
Owner

@dhiltgen commented on GitHub (Feb 16, 2024):

@khromov was pointing out you can purchase fairly recent CPUs that intel has chosen not to include AVX features in, so unfortunately there are ~modern systems out there that fall into this scenario. I'm still concerned that the performance is going to be really bad if you can't fit 100% of the model into the GPU.

I think what probably makes the most sense for this one is to refine our build scripts to make it much easier for users to build their own copy of ollama from source that disables AVX and other vector extensions for all build components.

<!-- gh-comment-id:1948821011 --> @dhiltgen commented on GitHub (Feb 16, 2024): @khromov was pointing out you can purchase fairly recent CPUs that intel has chosen not to include AVX features in, so unfortunately there are ~modern systems out there that fall into this scenario. I'm still concerned that the performance is going to be really bad if you can't fit 100% of the model into the GPU. I think what probably makes the most sense for this one is to refine our build scripts to make it much easier for users to build their own copy of ollama from source that disables AVX and other vector extensions for all build components.
Author
Owner

@navr32 commented on GitHub (Apr 4, 2024):

Hello i have the same problem with my Z800 2* x5675 Xeon so 24thread and 96gb of ram
and One RTX3090 FE 24GB of Vram.

I am running well with llamacpp without any issue at more than 30Tok/s on models fit in Vram... and i have test previously build of llamacpp with vulkan with an old Amd RX480 give me 2tok/s .

So i want to have my rtx3090 used by ollama too now..but failed.
But last ollama git Says no AVX so unable to use the GPU but this work with llamacpp so must with ollama..

so i have find the commit of 45eb104849
clone the dbzoo branch and build ok . (build On manjaro latest stable 6.6.19-1-MANJARO #1 SMP PREEMPT_DYNAMIC and Cuda 12.3.2-1. gcc (GCC) 13.2.1 20230801 and nvcc cuda_12.3.r12.3/compiler.33567101_0).

With the dbzoo version i have 3.67Tok/s to max 4Tok/s with nous-hermes2-mixtral:8x7b-dpo-q4_K_M 26GB model and 20 thread set with : /set parameter num_thread 20....
And the prompt eval very very better with this version using the gpu on old processor and give
prompt eval rate: 108.09 tokens/s ..without gpu ..just 2.68 Tok/s.

So could you merge the commit of dbzoo to the last dev branch to let people with old processor improved with new gpu to be able to be happy with this , or people with new GPU that INTEL have crashed the instruction set with omitted avx but people which have good gpu be happy to use their gpu without this baddest version processor. !

Many many thanks to all for this wonderful dev of olllama . Have a nide days.

<!-- gh-comment-id:2038268068 --> @navr32 commented on GitHub (Apr 4, 2024): Hello i have the same problem with my Z800 2* x5675 Xeon so 24thread and 96gb of ram and One RTX3090 FE 24GB of Vram. I am running well with llamacpp without any issue at more than 30Tok/s on models fit in Vram... and i have test previously build of llamacpp with vulkan with an old Amd RX480 give me 2tok/s . So i want to have my rtx3090 used by ollama too now..but failed. But last ollama git Says no AVX so unable to use the GPU but this work with llamacpp so must with ollama.. so i have find the commit of https://github.com/dbzoo/ollama/commit/45eb1048496780a78ed07cf39b3ce6b62b5a72e3 clone the dbzoo branch and build ok . (build On manjaro latest stable 6.6.19-1-MANJARO #1 SMP PREEMPT_DYNAMIC and Cuda 12.3.2-1. gcc (GCC) 13.2.1 20230801 and nvcc cuda_12.3.r12.3/compiler.33567101_0). With the dbzoo version i have 3.67Tok/s to max 4Tok/s with nous-hermes2-mixtral:8x7b-dpo-q4_K_M 26GB model and 20 thread set with : /set parameter num_thread 20.... And the prompt eval very very better with this version using the gpu on old processor and give prompt eval rate: 108.09 tokens/s ..without gpu ..just 2.68 Tok/s. So could you merge the commit of dbzoo to the last dev branch to let people with old processor improved with new gpu to be able to be happy with this , or people with new GPU that INTEL have crashed the instruction set with omitted avx but people which have good gpu be happy to use their gpu without this baddest version processor. ! Many many thanks to all for this wonderful dev of olllama . Have a nide days.
Author
Owner

@angiopteris commented on GitHub (Apr 9, 2024):

I just fixed it by enabling AVX in proxmox but this seemed to still crash without AVX support

You need to enable your cpu to "host" mode to allow AVX instruction to be passthrough

<!-- gh-comment-id:2045397726 --> @angiopteris commented on GitHub (Apr 9, 2024): > I just fixed it by enabling AVX in proxmox but this seemed to still crash without AVX support You need to enable your cpu to "host" mode to allow AVX instruction to be passthrough
Author
Owner

@apunkt commented on GitHub (Apr 18, 2024):

You can force GPU compilation from source be editing /gpu/cpu_common.go line 20:

-> '''return "avx"'''

then compile with custom options:
OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all-major -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_F16C=off -DLLAMA_FMA=off"

it still will complain:
level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"

but it will run:
level=INFO source=server.go:125 msg="offload to gpu" reallayers=33 layers=33 required="5888.5 MiB" used="5888.5 MiB" available="6454.3 MiB" kv="1024.0 MiB" fulloffload="560.0 MiB" partialoffload="585.0 MiB"

so far didn't recognize any negative effects...

<!-- gh-comment-id:2063338121 --> @apunkt commented on GitHub (Apr 18, 2024): You can force GPU compilation from source be editing /gpu/cpu_common.go line 20: -> '''return "avx"''' then compile with custom options: OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all-major -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_F16C=off -DLLAMA_FMA=off" it still will complain: level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions" but it will run: level=INFO source=server.go:125 msg="offload to gpu" reallayers=33 layers=33 required="5888.5 MiB" used="5888.5 MiB" available="6454.3 MiB" kv="1024.0 MiB" fulloffload="560.0 MiB" partialoffload="585.0 MiB" so far didn't recognize any negative effects...
Author
Owner

@dhiltgen commented on GitHub (Apr 18, 2024):

@apunkt if you want to have a go at making a PR, see if you can set up a model where people can build from source and set an env var (or two) to toggle this. As you pointed out, you have to modify both the gpu variant logic, and the compile flags. For the GPU logic, take a look at how we override the version at compile time here for inspiration.

<!-- gh-comment-id:2065481554 --> @dhiltgen commented on GitHub (Apr 18, 2024): @apunkt if you want to have a go at making a PR, see if you can set up a model where people can build from source and set an env var (or two) to toggle this. As you pointed out, you have to modify both the gpu variant logic, and the compile flags. For the GPU logic, take a look at how we override the version at compile time [here](https://github.com/ollama/ollama/blob/main/scripts/build_linux.sh#L6) for inspiration.
Author
Owner

@apunkt commented on GitHub (Apr 19, 2024):

@dhiltgen did further testing.
It works for me with 1 GPU: loading bigger models, that don't fit into VRAM also works, load is splitted between GPU/CPU
It causes problems for me with 2 GPU: when model fits into VRAM using non-default num_ctx.
This causes then cudaMalloc failed: out of memory on 1st device.
i.e. codellama:7b works fine on two 3050 8GB up to 12k num_ctx, setting to 16k causes unrecoverable crash when trying llm_load_tensors: offloaded 31/33 layers to GPU

any hints? Is it related to https://github.com/ollama/ollama/issues/3711

<!-- gh-comment-id:2065861685 --> @apunkt commented on GitHub (Apr 19, 2024): @dhiltgen did further testing. It works for me with 1 GPU: loading bigger models, that don't fit into VRAM also works, load is splitted between GPU/CPU It causes problems for me with 2 GPU: when model fits into VRAM using non-default num_ctx. This causes then cudaMalloc failed: out of memory on 1st device. i.e. codellama:7b works fine on two 3050 8GB up to 12k num_ctx, setting to 16k causes unrecoverable crash when trying llm_load_tensors: offloaded 31/33 layers to GPU any hints? Is it related to https://github.com/ollama/ollama/issues/3711
Author
Owner

@navr32 commented on GitHub (Apr 29, 2024):

So thanks to @apunkt for this new trick. Now i can again Re run recent version of ollama on linux manjaro .
So Before build like @apunkt say :

`
You can force GPU compilation from source by editing /gpu/cpu_common.go line 20:

-> return "avx"

And force avx=off like says @dbzoo in line 54 off linux gen llm/generate/gen_linux.sh :

53 fi
54 COMMON_CMAKE_DEFS="-DCMAKE_POSITION_INDEPENDENT_CODE=on -DLLAMA_NATIVE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off"
55 source $(dirname $0)/gen_common.sh

and build with :

OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all-major -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_F16C=off -DLLAMA_FMA=off" go generate ./... go build . 

So for little testing i test ollama run llama3:70b-instruct-q4_K_M 42GB on the oldier than previous i was speaking of Z800 2* X5570 Xeon so 16 threads and here less ram ..32gb of ram so less than the model size tested here but Two RTX3090 FE 24GB of Vram..so all the models fit in the total amount of Vram..

ollama serve :

`
{"build":2737,"commit":"46e12c4","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140180613820416","timestamp":1714383139}
{"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LAMMAFILE = 1 | ","tid":"140180613820416","timestamp":1714383139,"total_threads":16}
llama_model_loader: loaded meta data with 21 key-value pairs and 723 tensors from .ollama/models/blobs/sha256-376e96b1c2955bab3d2bad2eb1bfa00a7676667ed1812d9770f0f243779fec31 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-70B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 80
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_K:  441 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 39.59 GiB (4.82 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-70B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    1.10 MiB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:        CPU buffer size =   563.62 MiB
llm_load_tensors:      CUDA0 buffer size = 20038.81 MiB
llm_load_tensors:      CUDA1 buffer size = 19940.67 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   328.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   312.00 MiB
llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.52 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   400.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   400.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    32.02 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 3
{"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140180613820416","timestamp":1714383286}
{"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140180613820416","timestamp":1714383286}
{"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"140180613820416","timestamp":1714383286}




`

This setup give :


> ./ollama run llama3:70b-instruct-q4_K_M --verbose                         ✔  2m 50s  
> >>> Bonjour.
> Bonjour ! Comment allez-vous aujourd'hui ?
> 
> total duration:       917.654584ms
> load duration:        2.029048ms
> prompt eval count:    7 token(s)
> prompt eval duration: 161.504ms
> prompt eval rate:     43.34 tokens/s
> eval count:           10 token(s)
> eval duration:        624.022ms
> eval rate:            16.03 tokens/s
> >>>  write me the fibonnaci function in python.
> Here is a simple implementation of the Fibonacci sequence in Python:
> ```
> def fibonacci(n):
>     if n <=  of 1:
>         return n
>     else:
>         return fibonacci(n-1) + fibonacci(n-2)
> ```
> This function uses recursion to calculate the `n`-th Fibonacci number.
> 
> However, this implementation has an exponential time complexity due to the repeated calculations. A more efficient way to implement the
> Fibonacci sequence is using an iterative approach:
> ```
> def fibonacci(n):
>     a, b = 0, 1
>     for i in range(n):
>         a, b = b, a + b
>     return a
> ```
> This function uses a loop to calculate the `n`-th Fibonacci number in linear time complexity.
> 
> You can also use a closed-form expression known as Binet's formula to calculate the `n`-th Fibonacci number:
> ```
> def fibonacci(n):
>     phi = (1 + 5**0.5) / 2
>     return int((phi**n - (1-phi)**n) / 5**0.5)
> ```
> This function uses a mathematical formula to calculate the `n`-th Fibonacci number in constant time complexity.
> 
> Let me know if you have any questions or need further clarification!
> 
> total duration:       18.878588932s
> load duration:        1.98954ms
> prompt eval count:    31 token(s)
> prompt eval duration: 406.253ms
> prompt eval rate:     76.31 tokens/s
> eval count:           263 token(s)
> eval duration:        18.337893s
> eval rate:            14.34 tokens/s
> >>> Add some function to give ascii art generation of the fibonnaci number;
> Here's an updated implementation that includes a function to generate ASCII art for each Fibonacci number:
> ```python
> def fibonacci(n):
>     a, b = 0, 1
>     for i in range(n):
>         a, b = b, a + b
>     return a
> 
> def fibonacci_ascii_art(n):
>     fib_num = fibonacci(n)
>     num_str = str(fib_num)
>     art_width = len(num_str) * 2
>     art = ""
>     
>     # Top border
>     art += "*" * (art_width + 4) + "\n"
>     
>     # Fibonacci number
>     art += "* " + num_str + " *\n"
>     
>     # Bottom border
>     art += "*" * (art_width + 4) + "\n"
>     
>     return art
> 
> def generate_fibonacci_ascii_art(n):
>     for i in range(1, n+1):
>         print(f"Fibonacci number {i}:")
>         print(fibonacci_ascii_art(i))
>         print()
> 
> # Example usage
> generate_fibonacci_ascii_art(10)
> ```
> This code defines three functions:
> 
> * `fibonacci(n)`: calculates the `n`-th Fibonacci number using an iterative approach.
> * `fibonacci_ascii_art(n)`: generates ASCII art for the `n`-th Fibonacci number. It creates a box with the Fibonacci number inside, 
> using asterisks (`*`) to form the borders.
> * `generate_fibonacci_ascii_art(n)`: generates ASCII art for each Fibonacci number from 1 to `n`, and prints them all.
> 
> When you run this code, it will generate ASCII art for the first 10 Fibonacci numbers. You can adjust the `n` parameter in the 
> `generate_fibonacci_ascii_art(n)` function call to generate more or fewer Fibonacci numbers.
> 
> Here's an example output:
> ```
> Fibonacci number 1:
> ***********
> * 0 *
> ***********
> 
> Fibonacci number 2:
> ***********
> * 1 *
> ***********
> 
> Fibonacci number 3:
> ***********
> * 1 *
> ***********
> 
> Fibonacci number 4:
> *************
> * 2 *
> *************
> 
> Fibonacci number 5:
> **************
> * 3 *
> **************
> 
> Fibonacci number 6:
> ****************
> * 5 *
> ****************
> 
> Fibonacci number 7:
> *****************
> * 8 *
> *****************
> 
> Fibonacci number 8:
> *******************
> * 13 *
> *******************
> 
> Fibonacci number 9:
> *********************
> * 21 *
> *********************
> 
> Fibonacci number 10:
> ***********************
> * 34 *
> ***********************
> ```
> I hope this helps!
> 
> total duration:       39.770240959s
> load duration:        2.145353ms
> prompt eval count:    302 token(s)
> prompt eval duration: 1.319832s
> prompt eval rate:     228.82 tokens/s
> eval count:           537 token(s)
> eval duration:        38.221424s
> eval rate:            14.05 tokens/s
> >>> Send a message (/? for help)
> 

Gpu usage when generation occurs :

run_on_gpu_without_avx_cpu_model_llama3_70b

So i think it's very nice result. And give the Gpu Runner viable alternative even without AVX processor and to be stay in the main of ollama. Thanks to all.

<!-- gh-comment-id:2082334649 --> @navr32 commented on GitHub (Apr 29, 2024): So thanks to @apunkt for this new trick. Now i can again Re run recent version of ollama on linux manjaro . So Before build like @apunkt say : ` You can force GPU compilation from source by editing /gpu/cpu_common.go line 20: `-> return "avx"` And force avx=off like says @dbzoo in line 54 off linux gen llm/generate/gen_linux.sh : ``` 53 fi 54 COMMON_CMAKE_DEFS="-DCMAKE_POSITION_INDEPENDENT_CODE=on -DLLAMA_NATIVE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off" 55 source $(dirname $0)/gen_common.sh ``` and build with : ``` OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all-major -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_F16C=off -DLLAMA_FMA=off" go generate ./... go build . ``` So for little testing i test ollama run llama3:70b-instruct-q4_K_M 42GB on the oldier than previous i was speaking of Z800 2* X5570 Xeon so 16 threads and here less ram ..32gb of ram so less than the model size tested here but Two RTX3090 FE 24GB of Vram..so all the models fit in the total amount of Vram.. ollama serve : ``` ` {"build":2737,"commit":"46e12c4","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140180613820416","timestamp":1714383139} {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LAMMAFILE = 1 | ","tid":"140180613820416","timestamp":1714383139,"total_threads":16} llama_model_loader: loaded meta data with 21 key-value pairs and 723 tensors from .ollama/models/blobs/sha256-376e96b1c2955bab3d2bad2eb1bfa00a7676667ed1812d9770f0f243779fec31 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-70B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 80 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 8192 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 6: llama.attention.head_count u32 = 64 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 15 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q4_K: 441 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = Meta-Llama-3-70B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 1.10 MiB llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: CPU buffer size = 563.62 MiB llm_load_tensors: CUDA0 buffer size = 20038.81 MiB llm_load_tensors: CUDA1 buffer size = 19940.67 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 328.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 312.00 MiB llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) llama_new_context_with_model: CUDA0 compute buffer size = 400.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 400.02 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 32.02 MiB llama_new_context_with_model: graph nodes = 2566 llama_new_context_with_model: graph splits = 3 {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140180613820416","timestamp":1714383286} {"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140180613820416","timestamp":1714383286} {"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"140180613820416","timestamp":1714383286} ` ``` This setup give : ``` > ./ollama run llama3:70b-instruct-q4_K_M --verbose  ✔  2m 50s  > >>> Bonjour. > Bonjour ! Comment allez-vous aujourd'hui ? > > total duration: 917.654584ms > load duration: 2.029048ms > prompt eval count: 7 token(s) > prompt eval duration: 161.504ms > prompt eval rate: 43.34 tokens/s > eval count: 10 token(s) > eval duration: 624.022ms > eval rate: 16.03 tokens/s > >>> write me the fibonnaci function in python. > Here is a simple implementation of the Fibonacci sequence in Python: > ``` > def fibonacci(n): > if n <= of 1: > return n > else: > return fibonacci(n-1) + fibonacci(n-2) > ``` > This function uses recursion to calculate the `n`-th Fibonacci number. > > However, this implementation has an exponential time complexity due to the repeated calculations. A more efficient way to implement the > Fibonacci sequence is using an iterative approach: > ``` > def fibonacci(n): > a, b = 0, 1 > for i in range(n): > a, b = b, a + b > return a > ``` > This function uses a loop to calculate the `n`-th Fibonacci number in linear time complexity. > > You can also use a closed-form expression known as Binet's formula to calculate the `n`-th Fibonacci number: > ``` > def fibonacci(n): > phi = (1 + 5**0.5) / 2 > return int((phi**n - (1-phi)**n) / 5**0.5) > ``` > This function uses a mathematical formula to calculate the `n`-th Fibonacci number in constant time complexity. > > Let me know if you have any questions or need further clarification! > > total duration: 18.878588932s > load duration: 1.98954ms > prompt eval count: 31 token(s) > prompt eval duration: 406.253ms > prompt eval rate: 76.31 tokens/s > eval count: 263 token(s) > eval duration: 18.337893s > eval rate: 14.34 tokens/s > >>> Add some function to give ascii art generation of the fibonnaci number; > Here's an updated implementation that includes a function to generate ASCII art for each Fibonacci number: > ```python > def fibonacci(n): > a, b = 0, 1 > for i in range(n): > a, b = b, a + b > return a > > def fibonacci_ascii_art(n): > fib_num = fibonacci(n) > num_str = str(fib_num) > art_width = len(num_str) * 2 > art = "" > > # Top border > art += "*" * (art_width + 4) + "\n" > > # Fibonacci number > art += "* " + num_str + " *\n" > > # Bottom border > art += "*" * (art_width + 4) + "\n" > > return art > > def generate_fibonacci_ascii_art(n): > for i in range(1, n+1): > print(f"Fibonacci number {i}:") > print(fibonacci_ascii_art(i)) > print() > > # Example usage > generate_fibonacci_ascii_art(10) > ``` > This code defines three functions: > > * `fibonacci(n)`: calculates the `n`-th Fibonacci number using an iterative approach. > * `fibonacci_ascii_art(n)`: generates ASCII art for the `n`-th Fibonacci number. It creates a box with the Fibonacci number inside, > using asterisks (`*`) to form the borders. > * `generate_fibonacci_ascii_art(n)`: generates ASCII art for each Fibonacci number from 1 to `n`, and prints them all. > > When you run this code, it will generate ASCII art for the first 10 Fibonacci numbers. You can adjust the `n` parameter in the > `generate_fibonacci_ascii_art(n)` function call to generate more or fewer Fibonacci numbers. > > Here's an example output: > ``` > Fibonacci number 1: > *********** > * 0 * > *********** > > Fibonacci number 2: > *********** > * 1 * > *********** > > Fibonacci number 3: > *********** > * 1 * > *********** > > Fibonacci number 4: > ************* > * 2 * > ************* > > Fibonacci number 5: > ************** > * 3 * > ************** > > Fibonacci number 6: > **************** > * 5 * > **************** > > Fibonacci number 7: > ***************** > * 8 * > ***************** > > Fibonacci number 8: > ******************* > * 13 * > ******************* > > Fibonacci number 9: > ********************* > * 21 * > ********************* > > Fibonacci number 10: > *********************** > * 34 * > *********************** > ``` > I hope this helps! > > total duration: 39.770240959s > load duration: 2.145353ms > prompt eval count: 302 token(s) > prompt eval duration: 1.319832s > prompt eval rate: 228.82 tokens/s > eval count: 537 token(s) > eval duration: 38.221424s > eval rate: 14.05 tokens/s > >>> Send a message (/? for help) > ``` Gpu usage when generation occurs : ![run_on_gpu_without_avx_cpu_model_llama3_70b](https://github.com/ollama/ollama/assets/1823291/cab05c3b-b5f3-473c-a021-01ffa77fc922) So i think it's very nice result. And give the Gpu Runner viable alternative even without AVX processor and to be stay in the main of ollama. Thanks to all.
Author
Owner

@galets commented on GitHub (May 2, 2024):

I was able to build and run ollama on older device without AVX using abovementioned method. Thanks!!!

<!-- gh-comment-id:2091832379 --> @galets commented on GitHub (May 2, 2024): I was able to build and run ollama on older device without AVX using abovementioned method. Thanks!!!
Author
Owner

@lenhardtx commented on GitHub (May 12, 2024):

It works from me.
HP DL360 G7
Proxmox 7.4-3 - Kernel 6.2.11-1-pve-relaxablermrr -> https://github.com/Aterfax/relax-intel-rmrr (pci passthrough)
VM: Oracle Linux 9.4
CUDA 12.4
NVidia A4000

<!-- gh-comment-id:2106116432 --> @lenhardtx commented on GitHub (May 12, 2024): It works from me. HP DL360 G7 Proxmox 7.4-3 - Kernel 6.2.11-1-pve-relaxablermrr -> https://github.com/Aterfax/relax-intel-rmrr (pci passthrough) VM: Oracle Linux 9.4 CUDA 12.4 NVidia A4000
Author
Owner

@andydvsn commented on GitHub (May 14, 2024):

Just to note on a MacPro5,1 with X5690 CPUs and AMD Radeon VII GPU running Debian Bookworm, attempting to run any model drops out with "Error: llama runner process has terminated: signal: illegal instruction (core dumped)".

This is version 0.1.37 and there's no graceful fallback to CPU only. I'm brand new to Ollama and can't do any more troubleshooting this evening, but will keep an eye on things. Just wanted to flag this X5690 / AMD combination, as I think most solutions above are Nvidia focused.

<!-- gh-comment-id:2111287908 --> @andydvsn commented on GitHub (May 14, 2024): Just to note on a MacPro5,1 with X5690 CPUs and AMD Radeon VII GPU running Debian Bookworm, attempting to run any model drops out with _"Error: llama runner process has terminated: signal: illegal instruction (core dumped)"._ This is version 0.1.37 and there's no graceful fallback to CPU only. I'm brand new to Ollama and can't do any more troubleshooting this evening, but will keep an eye on things. Just wanted to flag this X5690 / AMD combination, as I think most solutions above are Nvidia focused.
Author
Owner

@dhiltgen commented on GitHub (May 14, 2024):

@andydvsn can you open a new issue for your scenario? We don't currently support GPUs on Intel macs (tracked via #1016) however it shouldn't crash.

<!-- gh-comment-id:2111338310 --> @dhiltgen commented on GitHub (May 14, 2024): @andydvsn can you open a new issue for your scenario? We don't currently support GPUs on Intel macs (tracked via #1016) however it shouldn't crash.
Author
Owner

@andydvsn commented on GitHub (May 14, 2024):

@andydvsn can you open a new issue for your scenario? We don't currently support GPUs on Intel macs (tracked via #1016) however it shouldn't crash.

Apologies, I should have been clearer - this is a Mac Pro system, but it is not running macOS. This is on Debian Bookworm and the Ollama installation with the Linux instructions proceeded perfectly, including the detection of the GPU and download of AMD dependencies. It's essentially a Xeon workstation PC at this point, just in a silver box with an Apple logo on it.

<!-- gh-comment-id:2111342953 --> @andydvsn commented on GitHub (May 14, 2024): > @andydvsn can you open a new issue for your scenario? We don't currently support GPUs on Intel macs (tracked via #1016) however it shouldn't crash. Apologies, I should have been clearer - this _is_ a Mac Pro system, but it is _not_ running macOS. This is on Debian Bookworm and the Ollama installation with the Linux instructions proceeded perfectly, including the detection of the GPU and download of AMD dependencies. It's essentially a Xeon workstation PC at this point, just in a silver box with an Apple logo on it.
Author
Owner

@mii-key commented on GitHub (May 30, 2024):

Firstly, thank you for the exceptional product!

Special thanks to @apunkt for the valuable tip.

For Windows systems, use "-DLLAMA_AVX=OFF" and "-DLLAMA_AVX2=OFF" in $script:cmakeDefs in the llm\generate\gen_windows.ps1 script file, function build_cuda().

Regarding older CPU models:

AVX has been around for ~13 years and I'm not aware of any modern x86 CPU that doesn't support it. The intersection of 14+
...
by @dhiltgen

While a 13-year argument holds merit, it's important to acknowledge that these old processors remain quite capable on servers and workstations today. Consequently, I believe many users would appreciate support for CPUs without AVX in your remarkable product.

<!-- gh-comment-id:2139525610 --> @mii-key commented on GitHub (May 30, 2024): Firstly, thank you for the exceptional product! Special thanks to @apunkt for the valuable tip. For Windows systems, use "-DLLAMA_AVX=OFF" and "-DLLAMA_AVX2=OFF" in `$script:cmakeDefs` in the `llm\generate\gen_windows.ps1` script file, function `build_cuda()`. Regarding older CPU models: > AVX has been around for ~13 years and I'm not aware of any modern x86 CPU that doesn't support it. The intersection of 14+ ... by @dhiltgen While a 13-year argument holds merit, it's important to acknowledge that these old processors remain quite capable on servers and workstations today. Consequently, I believe many users would appreciate support for CPUs without AVX in your remarkable product.
Author
Owner

@dhiltgen commented on GitHub (Jun 1, 2024):

PR #4517 lays foundation so we can document how to ~easily build from source to get a local build with different vector extensions for the GPU runners. Once that's merged this issue can be resolved with developer docs.

<!-- gh-comment-id:2143567352 --> @dhiltgen commented on GitHub (Jun 1, 2024): PR #4517 lays foundation so we can document how to ~easily build from source to get a local build with different vector extensions for the GPU runners. Once that's merged this issue can be resolved with developer docs.
Author
Owner

@hithold commented on GitHub (Jun 10, 2024):

AVX has been around for ~13 years and I'm not aware of any modern x86 CPU that doesn't support it.

There are still many modern processors without avx instructions, for example: Celeron G5900 and Pentium G6600 (Lga1200) are only 3 years old.
It would be great to be able to run llm on gpu on these modern systems without avx1.

<!-- gh-comment-id:2159082634 --> @hithold commented on GitHub (Jun 10, 2024): > AVX has been around for ~13 years and I'm not aware of any modern x86 CPU that doesn't support it. There are still many modern processors without avx instructions, for example: Celeron G5900 and Pentium G6600 (Lga1200) are only 3 years old. It would be great to be able to run llm on gpu on these modern systems without avx1.
Author
Owner

@AlexDeveloperUwU commented on GitHub (Jul 14, 2024):

Hey! Any updates in this?

I'm trying to follow the instructions and modifications that need to be done, but when I make a build, it only builds the "ollama_llama_server", so I'm confused.

What should I do? Can anyone share it's build?

This is the PC that I'm using:

image

It's a very old CPU, I know, but it's a test server and I wanted to try some AI on the gpu to test it's capabilities.

Thanks!

<!-- gh-comment-id:2227335978 --> @AlexDeveloperUwU commented on GitHub (Jul 14, 2024): Hey! Any updates in this? I'm trying to follow the instructions and modifications that need to be done, but when I make a build, it only builds the "ollama_llama_server", so I'm confused. What should I do? Can anyone share it's build? This is the PC that I'm using: ![image](https://github.com/user-attachments/assets/8abed536-8f0b-48c5-99f9-336c8bd00b0e) It's a very old CPU, I know, but it's a test server and I wanted to try some AI on the gpu to test it's capabilities. Thanks!
Author
Owner

@navr32 commented on GitHub (Jul 14, 2024):

Nvidia quadro k620 is too old to give you interesting result with ollama. You must have a minimum Vram size with about 20% more of the size of the model to have decent usage. So for example llama3 8B Q4 is 4.7GB so..need minimum vram size 6Go..to be about running ok on this old computer. And you must read the requirement of nvidia cuda version..and so and so.. After you have found if you have the requirement for build ok on you hardware perhaps little models size can run ok for example Qwen2 0.5b or 1.5b..935Mb and 352Mb..try all of this and tell us..but verify your Cuda version..

<!-- gh-comment-id:2227378005 --> @navr32 commented on GitHub (Jul 14, 2024): Nvidia quadro k620 is too old to give you interesting result with ollama. You must have a minimum Vram size with about 20% more of the size of the model to have decent usage. So for example llama3 8B Q4 is 4.7GB so..need minimum vram size 6Go..to be about running ok on this old computer. And you must read the requirement of nvidia cuda version..and so and so.. After you have found if you have the requirement for build ok on you hardware perhaps little models size can run ok for example Qwen2 0.5b or 1.5b..935Mb and 352Mb..try all of this and tell us..but verify your Cuda version..
Author
Owner

@AlexDeveloperUwU commented on GitHub (Jul 14, 2024):

I finally managed to get it working. Initially, the commands weren't executing properly, but I resolved that issue.

The Nvidia Quadro K620 works well with the Qwen 0.5b and Qwen 1.8b models. Both models generate AI responses quite quickly.

I made sure to update CUDA and the Toolkit to the latest available versions. Currently, I'm using Driver Version: 550.100 and CUDA Version: 12.4.

Given that this setup is primarily for testing purposes, I’m not concerned about running only small models. The goal is to learn and perform very basic tasks, and these smaller models are sufficient for that. So far, they've been working great.

<!-- gh-comment-id:2227468491 --> @AlexDeveloperUwU commented on GitHub (Jul 14, 2024): I finally managed to get it working. Initially, the commands weren't executing properly, but I resolved that issue. The Nvidia Quadro K620 works well with the Qwen 0.5b and Qwen 1.8b models. Both models generate AI responses quite quickly. I made sure to update CUDA and the Toolkit to the latest available versions. Currently, I'm using Driver Version: 550.100 and CUDA Version: 12.4. Given that this setup is primarily for testing purposes, I’m not concerned about running only small models. The goal is to learn and perform very basic tasks, and these smaller models are sufficient for that. So far, they've been working great.
Author
Owner

@navr32 commented on GitHub (Jul 15, 2024):

Perhaps this be good you publish here the command you have use for help others ubuntu users wanting try this setup like you on this old type of hardware. Have a nice day.

<!-- gh-comment-id:2228144643 --> @navr32 commented on GitHub (Jul 15, 2024): Perhaps this be good you publish here the command you have use for help others ubuntu users wanting try this setup like you on this old type of hardware. Have a nice day.
Author
Owner

@navr32 commented on GitHub (Aug 1, 2024):

For use with the latest v0.3.2 some very simple change have been done to :

so now change cpu_common.go line 15 with :

return CPUCapabilityAVX

And gen_linux.sh line 54 with :

COMMON_CMAKE_DEFS="-DBUILD_SHARED_LIBS=off -DCMAKE_POSITION_INDEPENDENT_CODE=on -DGGML_NATIVE=off -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_OPENMP=off"

<!-- gh-comment-id:2262876198 --> @navr32 commented on GitHub (Aug 1, 2024): For use with the latest v0.3.2 some very simple change have been done to : so now change cpu_common.go line 15 with : > ` return CPUCapabilityAVX` And gen_linux.sh line 54 with : > `COMMON_CMAKE_DEFS="-DBUILD_SHARED_LIBS=off -DCMAKE_POSITION_INDEPENDENT_CODE=on -DGGML_NATIVE=off -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_OPENMP=off"`
Author
Owner

@Pesc0 commented on GitHub (Sep 26, 2024):

Please remove this artificial restricion. llama3.1 8B works perfectly fine with my my setup:

  • Celeron G3900 (no AVX)
  • proxmox
  • archlinux VM with 4gb of ram
  • GTX1070 pci passtrough

it would have saved me one full afternoon of trying to compile and manually installing ollama (and i'm lucky i'm not a beginner, imagine someone who is new to all of this)

<!-- gh-comment-id:2377976364 --> @Pesc0 commented on GitHub (Sep 26, 2024): Please remove this artificial restricion. llama3.1 8B works perfectly fine with my my setup: - Celeron G3900 (no AVX) - proxmox - archlinux VM with 4gb of ram - GTX1070 pci passtrough it would have saved me one full afternoon of trying to compile and manually installing ollama (and i'm lucky i'm not a beginner, imagine someone who is new to all of this)
Author
Owner

@brycetryan commented on GitHub (Sep 27, 2024):

For use with the latest v0.3.2 some very simple change have been done to :

so now change cpu_common.go line 15 with :

return CPUCapabilityAVX

And gen_linux.sh line 54 with :

COMMON_CMAKE_DEFS="-DBUILD_SHARED_LIBS=off -DCMAKE_POSITION_INDEPENDENT_CODE=on -DGGML_NATIVE=off -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_OPENMP=off"

This worked perfectly thankyou. I have a cheap Celeron from a few years ago without AVX/2 but 5x 1060's. Inference now runs perfectly fine with multiple gpus

<!-- gh-comment-id:2378297642 --> @brycetryan commented on GitHub (Sep 27, 2024): > For use with the latest v0.3.2 some very simple change have been done to : > > so now change cpu_common.go line 15 with : > > > ` return CPUCapabilityAVX` > > And gen_linux.sh line 54 with : > > > `COMMON_CMAKE_DEFS="-DBUILD_SHARED_LIBS=off -DCMAKE_POSITION_INDEPENDENT_CODE=on -DGGML_NATIVE=off -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_OPENMP=off"` This worked perfectly thankyou. I have a cheap Celeron from a few years ago without AVX/2 but 5x 1060's. Inference now runs perfectly fine with multiple gpus
Author
Owner

@digitalspaceport commented on GitHub (Oct 22, 2024):

Would love to see support for NVIDIA GPUs remain on the N3150/N3350/N3450 CPU that does not have AVX/2 chip support. These are popular chips for modernish SBC space which can have PCIe slots that do function with bus powered cards like the P2000. Zimaboard/Zimablade being one decent example I ran into this on.

<!-- gh-comment-id:2428666917 --> @digitalspaceport commented on GitHub (Oct 22, 2024): Would love to see support for NVIDIA GPUs remain on the N3150/N3350/N3450 CPU that does not have AVX/2 chip support. These are popular chips for modernish SBC space which can have PCIe slots that do function with bus powered cards like the P2000. Zimaboard/Zimablade being one decent example I ran into this on.
Author
Owner

@shkron commented on GitHub (Nov 9, 2024):

In the meantime, is there a similar "hack" to get it to work from the source code, similar to the earlier suggestions? Since the migration to go runner, I am no longer able to locate "gen_linux.sh", neither its migrated analog.

cc: @navr32 , @brycetryan , @dhiltgen

<!-- gh-comment-id:2466483264 --> @shkron commented on GitHub (Nov 9, 2024): In the meantime, is there a similar "hack" to get it to work from the source code, similar to the earlier suggestions? Since the migration to go runner, I am no longer able to locate "gen_linux.sh", neither its migrated analog. cc: @navr32 , @brycetryan , @dhiltgen
Author
Owner

@chris-hatton commented on GitHub (Nov 16, 2024):

As others have proven - it's technically unnecessary to require AVX support where only GPU inference is needed.
I hope one or more of the LocalAI maintainers would consider looking into this and unlocking the App for, probably, a small but significant bracket of users.

I'm running an HP Workstation as a home server, wth dual Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. In spite of being older, it's a relatively powerful server computer and just the kind of target for GPU inference.

<!-- gh-comment-id:2480336905 --> @chris-hatton commented on GitHub (Nov 16, 2024): As others have proven - it's technically unnecessary to require AVX support where only GPU inference is needed. I hope one or more of the LocalAI maintainers would consider looking into this and unlocking the App for, probably, a small but significant bracket of users. I'm running an HP Workstation as a home server, wth dual Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. In spite of being older, it's a relatively powerful server computer and just the kind of target for GPU inference.
Author
Owner

@tobiasgraeber commented on GitHub (Nov 17, 2024):

Subscribing. How to get around this AVX-restriction?

Flags seem to be upcomming with docs/development.md
(7d686a38e9 (diff-97db29a7915320e63d41d38a0440360a87055ee8ed03757aa263116dbbb4aabe)) . ? Any additional docs or plans to auto-detect/handle this in the regular release as well?? Thx!

<!-- gh-comment-id:2480894088 --> @tobiasgraeber commented on GitHub (Nov 17, 2024): Subscribing. How to get around this AVX-restriction? Flags seem to be upcomming with `docs/development.md` (https://github.com/ollama/ollama/pull/7499/commits/7d686a38e90b0145132ca613d564e18d863adb62#diff-97db29a7915320e63d41d38a0440360a87055ee8ed03757aa263116dbbb4aabe) . ? Any additional docs or plans to auto-detect/handle this in the regular release as well?? Thx!
Author
Owner

@SplendidAppendix commented on GitHub (Nov 24, 2024):

For use with the latest v0.3.2 some very simple change have been done to :

so now change cpu_common.go line 15 with :

return CPUCapabilityAVX

And gen_linux.sh line 54 with :

COMMON_CMAKE_DEFS="-DBUILD_SHARED_LIBS=off -DCMAKE_POSITION_INDEPENDENT_CODE=on -DGGML_NATIVE=off -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_OPENMP=off"

This worked perfectly thankyou. I have a cheap Celeron from a few years ago without AVX/2 but 5x 1060's. Inference now runs perfectly fine with multiple gpus

For those looking to run Ollama without the AVX flags, I have been running the 0.3.2 version according to these instructions with success. Still waiting for the new merges.

<!-- gh-comment-id:2496100281 --> @SplendidAppendix commented on GitHub (Nov 24, 2024): > > For use with the latest v0.3.2 some very simple change have been done to : > > > > > > so now change cpu_common.go line 15 with : > > > > > > > ` return CPUCapabilityAVX` > > > > > > And gen_linux.sh line 54 with : > > > > > > > `COMMON_CMAKE_DEFS="-DBUILD_SHARED_LIBS=off -DCMAKE_POSITION_INDEPENDENT_CODE=on -DGGML_NATIVE=off -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_OPENMP=off"` > > > > This worked perfectly thankyou. I have a cheap Celeron from a few years ago without AVX/2 but 5x 1060's. Inference now runs perfectly fine with multiple gpus For those looking to run Ollama without the AVX flags, I have been running the 0.3.2 version according to these instructions with success. Still waiting for the new merges.
Author
Owner

@roycwalton commented on GitHub (Nov 24, 2024):

Subscribing. I am running a P2000 with dual Xeon E5530 without avx support; I have Ollama running in a container; I can pass ENV easily enough but unsure how to compile with the above line fixes.

<!-- gh-comment-id:2496172680 --> @roycwalton commented on GitHub (Nov 24, 2024): Subscribing. I am running a P2000 with dual Xeon E5530 without avx support; I have Ollama running in a container; I can pass ENV easily enough but unsure how to compile with the above line fixes.
Author
Owner

@osering commented on GitHub (Dec 2, 2024):

Right now version 0.4.7 is out, the last advice how to circumvent this bug with AVX-less CPUs is for 0.3.2, with pretty substantive changes (including program structure changes, file deletions, new file inclusions) from versions 0.3 to 0.4.
Therefore advice from navr32 is no longer fully usable.
There is advice for same situation Running Ollama 0.3.12 on multiple-GPUs without AVX/AVX2 , but also incomplete.
You can clone respective 0.3 version, for instance last from 0.3 series - 0.3.14.
git clone --branch v0.3.14 https://github.com/ollama/ollama
Got file:
https://github.com/ollama/ollama/blob/v0.3.14/llm/generate/gen_linux.sh and change line 55, replacing "-DGGML_AVX=on" with "-DGGML_AVX=off" and the second file https://github.com/ollama/ollama/blob/v0.3.14/discover/cpu_common.go and in line 20 replacing "return CPUCapabilityNone" with "return CPUCapabilityAVX"
Installed cmake and go-golang.
Problem that in in 0.3 versions there is no makefile, so no use of make command. As well it not clear what is the right go generate/ go build command to compile. Used these: go generate ./... and go build -tags cuda12 but no working ollama executable were made (gave errors related to ggml, after copied to /usr/local/bin). Is there some env variables shall be used?

Could somebody explain (while there is no respective flag implemented) - what is the right procedure to compile on linux (ubuntu 24.10, amd64, nvidia-560 driver, cuda 12.6) this modified code and install it. What should be done with files libggml.so.gz, libllama.so.gz and ollama_llama_server.gz created in folders ~/ollama/build/linux/amd64/cpu|cpu_avx|cpu_avx2? What are the next steps?
Thanks in advance!

<!-- gh-comment-id:2513119979 --> @osering commented on GitHub (Dec 2, 2024): Right now version 0.4.7 is out, the last advice how to circumvent this bug with AVX-less CPUs is for 0.3.2, with pretty substantive changes (including program structure changes, file deletions, new file inclusions) from versions 0.3 to 0.4. Therefore advice from [navr32](https://github.com/navr32) is no longer fully usable. There is advice for same situation [Running Ollama 0.3.12 on multiple-GPUs without AVX/AVX2 ](https://ecks90.com/post/66f5f313d5ee9), but also incomplete. You can clone respective 0.3 version, for instance last from 0.3 series - 0.3.14. git clone --branch v0.3.14 https://github.com/ollama/ollama Got file: https://github.com/ollama/ollama/blob/v0.3.14/llm/generate/gen_linux.sh and change line 55, replacing "-DGGML_AVX=**on**" with "-DGGML_AVX=**off**" and the second file https://github.com/ollama/ollama/blob/v0.3.14/discover/cpu_common.go and in line 20 replacing "return CPUCapability**None**" with "return CPUCapability**AVX**" Installed cmake and go-golang. Problem that in in 0.3 versions there is no makefile, so no use of make command. As well it not clear what is the right go generate/ go build command to compile. Used these: go generate ./... and go build -tags cuda12 but no working ollama executable were made (gave errors related to ggml, after copied to /usr/local/bin). Is there some env variables shall be used? Could somebody explain (while there is no respective flag implemented) - what is the right procedure to compile on linux (ubuntu 24.10, amd64, nvidia-560 driver, cuda 12.6) this modified code and install it. What should be done with files libggml.so.gz, libllama.so.gz and ollama_llama_server.gz created in folders ~/ollama/build/linux/amd64/cpu|cpu_avx|cpu_avx2? What are the next steps? Thanks in advance!
Author
Owner

@navr32 commented on GitHub (Dec 7, 2024):

Good News for Non-AVX Processors!

Hi all! We've got very good news for everyone waiting to run Ollama on their non-AVX processors with big "GPUs"!

I'm not sure how many of you have seen the update by @dhiltgen, so i post here too for those who have not seen .
So with this change, you can now run Ollama without
AVX using easy steps. and no hack in the code.

To learn more and get started, head over to:
https://github.com/ollama/ollama/issues/7622#issuecomment-2524637378

Have a nice day and happy run !

<!-- gh-comment-id:2524678165 --> @navr32 commented on GitHub (Dec 7, 2024): ### Good News for Non-AVX Processors! Hi all! We've got very good news for everyone waiting to run Ollama on their non-AVX processors with big "GPUs"! I'm not sure how many of you have seen the update by [@dhiltgen](https://github.com/dhiltgen), so i post here too for those who have not seen . So with this change, you can now run Ollama without AVX using easy steps. and no hack in the code. To learn more and get started, head over to: [https://github.com/ollama/ollama/issues/7622#issuecomment-2524637378](https://github.com/ollama/ollama/issues/7622#issuecomment-2524637378) Have a nice day and happy run !
Author
Owner

@osering commented on GitHub (Dec 8, 2024):

AVX-less CPU runs now (and most probably can be upstreamed)! Great, but next problem arose

Great that finally AVX-less Celeron CPU is deployable. There is just some issues with installation afterwards to have right things in right folder with right permissions as there is no pre-made script for locally compiled ollama version (had to set user/group from ollama to root in ollama.service).
But there is still a problem (not sure if AVX related).
When running without env variables set, it's just identifies cpu as inference engine and still no cuda_v12, although GPUs are identified as capable.
When cuda_v12 is forced, another problem arises - although there is plenty (9 x 5.9GB) of available VRAM, it says: "gpu has too little memory to allocate any layers" and "insufficient VRAM to load any model layers" still it falls back to CPU (and memory overflow as there is not enough RAM).
Here's the (Ubuntu 24.10 x86-64 2-core Celeron with 4GB ram, testing on model: llama3.2-vision) logs:

$ journalctl -u ollama -f --no-pager
dec 08 18:05:34 rig2 ollama[19103]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.560.35.03
dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuInit - 0x7e99c4c5d7f0
dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDriverGetVersion - 0x7e99c4c5d810
dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDeviceGetCount - 0x7e99c4c5d850
dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDeviceGet - 0x7e99c4c5d830
dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDeviceGetAttribute - 0x7e99c4c5d930
dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDeviceGetUuid - 0x7e99c4c5d890
dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDeviceGetName - 0x7e99c4c5d870
dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuCtxCreate_v3 - 0x7e99c4c68060
dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuMemGetInfo_v2 - 0x7e99c4c73520
dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuCtxDestroy - 0x7e99c4cce380
dec 08 18:05:34 rig2 ollama[19103]: calling cuInit
dec 08 18:05:34 rig2 ollama[19103]: calling cuDriverGetVersion
dec 08 18:05:34 rig2 ollama[19103]: raw version 0x2f1c
dec 08 18:05:34 rig2 ollama[19103]: CUDA driver version: 12.6
dec 08 18:05:34 rig2 ollama[19103]: calling cuDeviceGetCount
dec 08 18:05:34 rig2 ollama[19103]: device count 9
dec 08 18:05:34 rig2 ollama[19103]: time=2024-12-08T18:05:34.402+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-2e0afdf4-156c-0c78-2dc9-664178fafa81 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB"
dec 08 18:05:34 rig2 ollama[19103]: time=2024-12-08T18:05:34.660+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-5c60883b-8d5e-aa98-5f3f-5b2f0a6ed261 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB"
dec 08 18:05:34 rig2 ollama[19103]: time=2024-12-08T18:05:34.903+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-7aed2640-3989-5e38-0bc1-dbc2d525942e name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB"
dec 08 18:05:35 rig2 ollama[19103]: time=2024-12-08T18:05:35.147+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-c7578182-82ea-ce62-c9f7-6ca988b7fcb9 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB"
dec 08 18:05:35 rig2 ollama[19103]: time=2024-12-08T18:05:35.394+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-25cabcf6-5d8a-1d2a-7df6-7d98094dc567 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB"
dec 08 18:05:35 rig2 ollama[19103]: time=2024-12-08T18:05:35.644+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-d3fe759b-d280-173c-6a77-d1f00274ceaa name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB"
dec 08 18:05:35 rig2 ollama[19103]: time=2024-12-08T18:05:35.888+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-733fbbf1-e0c3-5e16-66e7-f3a3e25ae8df name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.132+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-d04b0338-309d-b9c8-c713-738b298f37e9 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.390+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-3f9559e8-f031-157a-5c51-b6b883621fc4 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB"
dec 08 18:05:36 rig2 ollama[19103]: releasing cuda driver library
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.390+02:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x56aeea6ca9c0 gpu_count=9
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.465+02:00 level=DEBUG source=sched.go:224 msg="loading first model" model=/root/.ollama/models/blobs/sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.465+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.469+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-2e0afdf4-156c-0c78-2dc9-664178fafa81 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.470+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.470+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.476+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-5c60883b-8d5e-aa98-5f3f-5b2f0a6ed261 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.476+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.476+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.481+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-7aed2640-3989-5e38-0bc1-dbc2d525942e library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.482+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.482+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.486+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-c7578182-82ea-ce62-c9f7-6ca988b7fcb9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.486+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.486+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.491+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-25cabcf6-5d8a-1d2a-7df6-7d98094dc567 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.491+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.491+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.496+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-d3fe759b-d280-173c-6a77-d1f00274ceaa library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.496+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.496+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.503+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-733fbbf1-e0c3-5e16-66e7-f3a3e25ae8df library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.503+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.503+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.507+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-d04b0338-309d-b9c8-c713-738b298f37e9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.508+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.508+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.512+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-3f9559e8-f031-157a-5c51-b6b883621fc4 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.513+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.513+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=9 available="[5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB]"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-2e0afdf4-156c-0c78-2dc9-664178fafa81 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-5c60883b-8d5e-aa98-5f3f-5b2f0a6ed261 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-7aed2640-3989-5e38-0bc1-dbc2d525942e library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-c7578182-82ea-ce62-c9f7-6ca988b7fcb9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-25cabcf6-5d8a-1d2a-7df6-7d98094dc567 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-d3fe759b-d280-173c-6a77-d1f00274ceaa library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-733fbbf1-e0c3-5e16-66e7-f3a3e25ae8df library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-d04b0338-309d-b9c8-c713-738b298f37e9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-3f9559e8-f031-157a-5c51-b6b883621fc4 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.518+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers"
dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.518+02:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="3.7 GiB" before.free="2.8 GiB" before.free_swap="296.3 MiB" now.total="3.7 GiB" now.free="2.7 GiB" now.free_swap="296.3 MiB"

and

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA P106-090                Off |   00000000:03:00.0 Off |                  N/A |
|  0%   15C    P8              5W /   75W |       7MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA P106-090                Off |   00000000:04:00.0 Off |                  N/A |
|  0%   16C    P8              5W /   75W |       7MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA P106-090                Off |   00000000:05:00.0 Off |                  N/A |
|  0%   15C    P8              5W /   75W |       7MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA P106-090                Off |   00000000:06:00.0 Off |                  N/A |
|  0%   16C    P8              5W /   75W |       7MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA P106-090                Off |   00000000:07:00.0 Off |                  N/A |
|  0%   14C    P8              5W /   75W |       7MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA P106-090                Off |   00000000:0B:00.0 Off |                  N/A |
|  0%   17C    P8              5W /   75W |       7MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA P106-090                Off |   00000000:0C:00.0 Off |                  N/A |
|  0%   16C    P8              5W /   75W |       7MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA P106-090                Off |   00000000:0D:00.0 Off |                  N/A |
|  0%   15C    P8              5W /   75W |       7MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   8  NVIDIA P106-090                Off |   00000000:0E:00.0 Off |                  N/A |
|  0%   16C    P8              5W /   75W |       7MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1179      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A      1179      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A      1179      G   /usr/lib/xorg/Xorg                              4MiB |
|    3   N/A  N/A      1179      G   /usr/lib/xorg/Xorg                              4MiB |
|    4   N/A  N/A      1179      G   /usr/lib/xorg/Xorg                              4MiB |
|    5   N/A  N/A      1179      G   /usr/lib/xorg/Xorg                              4MiB |
|    6   N/A  N/A      1179      G   /usr/lib/xorg/Xorg                              4MiB |
|    7   N/A  N/A      1179      G   /usr/lib/xorg/Xorg                              4MiB |
|    8   N/A  N/A      1179      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

What's done wrong?

<!-- gh-comment-id:2526342236 --> @osering commented on GitHub (Dec 8, 2024): **AVX-less CPU runs now (and most probably can be upstreamed)! Great, but next problem arose** Great that finally AVX-less Celeron CPU is deployable. There is just some issues with installation afterwards to have right things in right folder with right permissions as there is no pre-made script for locally compiled ollama version (had to set user/group from ollama to root in ollama.service). But there is still a problem (not sure if AVX related). When running without env variables set, it's just identifies cpu as inference engine and still no cuda_v12, although GPUs are identified as capable. When cuda_v12 is forced, another problem arises - although there is plenty (9 x 5.9GB) of available VRAM, it says: "gpu has too little memory to allocate any layers" and "insufficient VRAM to load any model layers" still it falls back to CPU (and memory overflow as there is not enough RAM). Here's the (Ubuntu 24.10 x86-64 2-core Celeron with 4GB ram, testing on model: llama3.2-vision) logs: ``` $ journalctl -u ollama -f --no-pager dec 08 18:05:34 rig2 ollama[19103]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.560.35.03 dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuInit - 0x7e99c4c5d7f0 dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDriverGetVersion - 0x7e99c4c5d810 dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDeviceGetCount - 0x7e99c4c5d850 dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDeviceGet - 0x7e99c4c5d830 dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDeviceGetAttribute - 0x7e99c4c5d930 dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDeviceGetUuid - 0x7e99c4c5d890 dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuDeviceGetName - 0x7e99c4c5d870 dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuCtxCreate_v3 - 0x7e99c4c68060 dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuMemGetInfo_v2 - 0x7e99c4c73520 dec 08 18:05:34 rig2 ollama[19103]: dlsym: cuCtxDestroy - 0x7e99c4cce380 dec 08 18:05:34 rig2 ollama[19103]: calling cuInit dec 08 18:05:34 rig2 ollama[19103]: calling cuDriverGetVersion dec 08 18:05:34 rig2 ollama[19103]: raw version 0x2f1c dec 08 18:05:34 rig2 ollama[19103]: CUDA driver version: 12.6 dec 08 18:05:34 rig2 ollama[19103]: calling cuDeviceGetCount dec 08 18:05:34 rig2 ollama[19103]: device count 9 dec 08 18:05:34 rig2 ollama[19103]: time=2024-12-08T18:05:34.402+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-2e0afdf4-156c-0c78-2dc9-664178fafa81 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB" dec 08 18:05:34 rig2 ollama[19103]: time=2024-12-08T18:05:34.660+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-5c60883b-8d5e-aa98-5f3f-5b2f0a6ed261 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB" dec 08 18:05:34 rig2 ollama[19103]: time=2024-12-08T18:05:34.903+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-7aed2640-3989-5e38-0bc1-dbc2d525942e name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB" dec 08 18:05:35 rig2 ollama[19103]: time=2024-12-08T18:05:35.147+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-c7578182-82ea-ce62-c9f7-6ca988b7fcb9 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB" dec 08 18:05:35 rig2 ollama[19103]: time=2024-12-08T18:05:35.394+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-25cabcf6-5d8a-1d2a-7df6-7d98094dc567 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB" dec 08 18:05:35 rig2 ollama[19103]: time=2024-12-08T18:05:35.644+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-d3fe759b-d280-173c-6a77-d1f00274ceaa name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB" dec 08 18:05:35 rig2 ollama[19103]: time=2024-12-08T18:05:35.888+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-733fbbf1-e0c3-5e16-66e7-f3a3e25ae8df name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.132+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-d04b0338-309d-b9c8-c713-738b298f37e9 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.390+02:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-3f9559e8-f031-157a-5c51-b6b883621fc4 name="NVIDIA P106-090" overhead="0 B" before.total="5.9 GiB" before.free="5.9 GiB" now.total="5.9 GiB" now.free="5.9 GiB" now.used="46.8 MiB" dec 08 18:05:36 rig2 ollama[19103]: releasing cuda driver library dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.390+02:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x56aeea6ca9c0 gpu_count=9 dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.465+02:00 level=DEBUG source=sched.go:224 msg="loading first model" model=/root/.ollama/models/blobs/sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.465+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.469+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-2e0afdf4-156c-0c78-2dc9-664178fafa81 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.470+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.470+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.476+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-5c60883b-8d5e-aa98-5f3f-5b2f0a6ed261 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.476+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.476+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.481+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-7aed2640-3989-5e38-0bc1-dbc2d525942e library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.482+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.482+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.486+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-c7578182-82ea-ce62-c9f7-6ca988b7fcb9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.486+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.486+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.491+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-25cabcf6-5d8a-1d2a-7df6-7d98094dc567 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.491+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.491+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.496+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-d3fe759b-d280-173c-6a77-d1f00274ceaa library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.496+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.496+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.503+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-733fbbf1-e0c3-5e16-66e7-f3a3e25ae8df library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.503+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.503+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.507+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-d04b0338-309d-b9c8-c713-738b298f37e9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.508+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.508+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[5.9 GiB]" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.512+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-3f9559e8-f031-157a-5c51-b6b883621fc4 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="258.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.513+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.513+02:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=9 available="[5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB 5.9 GiB]" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-2e0afdf4-156c-0c78-2dc9-664178fafa81 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-5c60883b-8d5e-aa98-5f3f-5b2f0a6ed261 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-7aed2640-3989-5e38-0bc1-dbc2d525942e library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-c7578182-82ea-ce62-c9f7-6ca988b7fcb9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-25cabcf6-5d8a-1d2a-7df6-7d98094dc567 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-d3fe759b-d280-173c-6a77-d1f00274ceaa library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-733fbbf1-e0c3-5e16-66e7-f3a3e25ae8df library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-d04b0338-309d-b9c8-c713-738b298f37e9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.517+02:00 level=DEBUG source=memory.go:186 msg="gpu has too little memory to allocate any layers" id=GPU-3f9559e8-f031-157a-5c51-b6b883621fc4 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" minimum_memory=479199232 layer_size="148.9 MiB" gpu_zer_overhead="4.6 GiB" partial_offload="669.5 MiB" full_offload="669.5 MiB" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.518+02:00 level=DEBUG source=memory.go:330 msg="insufficient VRAM to load any model layers" dec 08 18:05:36 rig2 ollama[19103]: time=2024-12-08T18:05:36.518+02:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="3.7 GiB" before.free="2.8 GiB" before.free_swap="296.3 MiB" now.total="3.7 GiB" now.free="2.7 GiB" now.free_swap="296.3 MiB" ``` and ``` $ nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA P106-090 Off | 00000000:03:00.0 Off | N/A | | 0% 15C P8 5W / 75W | 7MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA P106-090 Off | 00000000:04:00.0 Off | N/A | | 0% 16C P8 5W / 75W | 7MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA P106-090 Off | 00000000:05:00.0 Off | N/A | | 0% 15C P8 5W / 75W | 7MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA P106-090 Off | 00000000:06:00.0 Off | N/A | | 0% 16C P8 5W / 75W | 7MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA P106-090 Off | 00000000:07:00.0 Off | N/A | | 0% 14C P8 5W / 75W | 7MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA P106-090 Off | 00000000:0B:00.0 Off | N/A | | 0% 17C P8 5W / 75W | 7MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA P106-090 Off | 00000000:0C:00.0 Off | N/A | | 0% 16C P8 5W / 75W | 7MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA P106-090 Off | 00000000:0D:00.0 Off | N/A | | 0% 15C P8 5W / 75W | 7MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 8 NVIDIA P106-090 Off | 00000000:0E:00.0 Off | N/A | | 0% 16C P8 5W / 75W | 7MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1179 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1179 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 1179 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 1179 G /usr/lib/xorg/Xorg 4MiB | | 4 N/A N/A 1179 G /usr/lib/xorg/Xorg 4MiB | | 5 N/A N/A 1179 G /usr/lib/xorg/Xorg 4MiB | | 6 N/A N/A 1179 G /usr/lib/xorg/Xorg 4MiB | | 7 N/A N/A 1179 G /usr/lib/xorg/Xorg 4MiB | | 8 N/A N/A 1179 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------------------+ ``` What's done wrong?
Author
Owner

@navr32 commented on GitHub (Dec 10, 2024):

@osering please give more details ! model size you load..? command line ? api ? Do you have success with others models ?

<!-- gh-comment-id:2531869160 --> @navr32 commented on GitHub (Dec 10, 2024): @osering please give more details ! model size you load..? command line ? api ? Do you have success with others models ?
Author
Owner

@shkron commented on GitHub (Dec 10, 2024):

A follow up question to the latest messages. How much VRAM does the llama3.2-vision 11b (7.9Gb) really need? I have the 11Gb GPU and apparently it doesn't fit with this build/patch and defaults to the CPU runner. Originally I thought that was just a bit too big to fit, now we have a similar experience. Wondering if there is a pattern with this particular model.

I am able to fit this one with no problem:

qwen:14b-text-v1.5-q3_K_M
7.4GB
<!-- gh-comment-id:2532402500 --> @shkron commented on GitHub (Dec 10, 2024): A follow up question to the latest messages. How much VRAM does the llama3.2-vision 11b (7.9Gb) really need? I have the 11Gb GPU and apparently it doesn't fit with this build/patch and defaults to the CPU runner. Originally I thought that was just a bit too big to fit, now we have a similar experience. Wondering if there is a pattern with this particular model. I am able to fit this one with no problem: ``` qwen:14b-text-v1.5-q3_K_M 7.4GB ```
Author
Owner

@osering commented on GitHub (Dec 12, 2024):

@osering please give more details ! model size you load..? command line ? api ? Do you have success with others models ?
Hi nav32. Thanks for inquiring!

  1. model used: llama3.2-vision 11b (7.9Gb), which gives message - Error: model requires more system memory (6.2 GiB) than is available (3.1 GiB)

  2. problem is that it cannot load it and fails as refusing using VRAM and too small RAM (4GB total). Basically I run CLI: ollama run llama3.2-vision
    and also tried: curl http://localhost:11434/api/generate -d "{"model":"llama3.2-vision","options":{"num_gpu":40},"prompt":"hello","stream":false}" and received: {"error":{"message":"model requires more system memory (6.2 GiB) than is available (3.1 GiB)","type":"api_error","param":null,"code":null}}
    so no difference. CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 ollama serve - did not help either.

  3. No models loading in VRAM (tried also llama3.1:8b and it gave different message, although dmesg showed the same - Error: llama runner process has terminated: error loading model: unable to allocate backend buffer
    llama_load_model_from_file: failed to load model
    ).
    Just small one (ollama run llama3.2:1b as well as :3b) run peacefully on this AVX-less Celeron 3865U CPU.
    When loading these bigger models, nvtop for a second or two shows that one/some of the GPUs switches from Graphic to Compute mode and there is jump of some + 60MB occupied VRAM and then back.

New to this thing. Don't know which way to head to find out problems source.

BTW dmesg showed errors:

[ 9575.935986] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-3.scope,task=nvidia-smi,pid=7908,uid=1000
[ 9575.936011] Out of memory: Killed process 7908 (nvidia-smi) total-vm:4215784kB, anon-rss:2324660kB, file-rss:20kB, shmem-rss:0kB, UID:1000 pgtables:4620kB oom_score_adj:0                                                                                                                           
[ 9797.291534] traps: ld.so[8064] trap invalid opcode ip:7a922d117d34 sp:7ffe7fc69a68 error:0 in llama-cpp-cuda[e8d34,7a922d0f3000+133c000]
[ 9837.126515] traps: ld.so[8091] trap invalid opcode ip:7a4010835a24 sp:7ffe34fb9be8 error:0 in llama-cpp-fallback[e5a24,7a4010812000+1122000]
[ 9877.298047] traps: llama-ggml[8117] trap divide error ip:7f9dfe36abb9 sp:7fffe42b8b20 error:0 in libc.so.6[175bb9,7f9dfe21d000+195000]
[ 9917.198826] traps: ld.so[8143] trap invalid opcode ip:738420f57a24 sp:7ffd05918358 error:0 in llama-cpp-fallback[e5a24,738420f34000+1122000]
[ 9957.384054] traps: piper[8177] trap divide error ip:74800c4c5bb9 sp:7fff152f63d0 error:0 in libc.so.6[175bb9,74800c378000+195000]
[ 9997.434457] traps: rwkv[8208] trap divide error ip:7676a8910bb9 sp:7fffd8b71ea0 error:0 in libc.so.6[175bb9,7676a87c3000+195000]
[10037.424512] traps: whisper[8237] trap divide error ip:71b8b14abbb9 sp:7ffcc1cfeba0 error:0 in libc.so.6[175bb9,71b8b135e000+195000]
[10077.463686] traps: huggingface[8265] trap divide error ip:713e60ebfbb9 sp:7ffc8c50f5c0 error:0 in libc.so.6[175bb9,713e60d72000+195000]
[10117.538563] traps: bert-embeddings[8315] trap divide error ip:7abbfd9bbbb9 sp:7ffc890c0e00 error:0 in libc.so.6[175bb9,7abbfd86e000+195000]
[13081.069733] perf: interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[13212.173267] traps: ollama_llama_se[11163] trap invalid opcode ip:74527d0cdcb4 sp:7ffc47436198 error:0 in libggml.so[10cb4,74527d0cb000+9e000]
[13616.379262] __vm_enough_memory: pid: 11592, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation
[13616.379274] __vm_enough_memory: pid: 11592, comm: ollama_llama_se, bytes: 4912934912 not enough memory for the allocation
[13616.379281] __vm_enough_memory: pid: 11592, comm: ollama_llama_se, bytes: 4913041408 not enough memory for the allocation
[13616.379338] __vm_enough_memory: pid: 11592, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation
[13772.697257] __vm_enough_memory: pid: 11833, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation
[13772.697269] __vm_enough_memory: pid: 11833, comm: ollama_llama_se, bytes: 4912934912 not enough memory for the allocation
[13772.697275] __vm_enough_memory: pid: 11833, comm: ollama_llama_se, bytes: 4913041408 not enough memory for the allocation
[13772.697325] __vm_enough_memory: pid: 11833, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation
[13799.229263] __vm_enough_memory: pid: 11990, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation
[13799.229275] __vm_enough_memory: pid: 11990, comm: ollama_llama_se, bytes: 4912934912 not enough memory for the allocation
[13799.229282] __vm_enough_memory: pid: 11990, comm: ollama_llama_se, bytes: 4913041408 not enough memory for the allocation
[13799.229339] __vm_enough_memory: pid: 11990, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation
[13861.607631] traps: ollama_llama_se[12214] trap invalid opcode ip:7b4839c82cb4 sp:7ffd891f7758 error:0 in libggml.so[10cb4,7b4839c80000+9e000]

can these TRAP ERRORS be the source of problem? Why these are on freshly installed Ubuntu 24.10 with freshly compiled Ollama? But after reboot these were gone. Shall I reinstall?

and here is ollama serve response (no errors, but msg="Dynamic LLM libraries" runners=[cpu] makes alert):

$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 ollama serve
2024/12/08 14:03:57 routes.go:1194: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1,2,3,4,5,6,7,8 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/lemuria/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-12-08T14:03:57.951+02:00 level=INFO source=images.go:753 msg="total blobs: 6"
time=2024-12-08T14:03:57.951+02:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-12-08T14:03:57.951+02:00 level=INFO source=routes.go:1245 msg="Listening on 127.0.0.1:11434 (version 0.5.0-rc1-6-g1841309)"
time=2024-12-08T14:03:57.952+02:00 level=INFO source=routes.go:1274 **msg="Dynamic LLM libraries" runners=[cpu]**
time=2024-12-08T14:03:57.952+02:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-2e0afdf4-156c-0c78-2dc9-664178fafa81 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB"
time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-5c60883b-8d5e-aa98-5f3f-5b2f0a6ed261 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB"
time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7aed2640-3989-5e38-0bc1-dbc2d525942e library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB"
time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-c7578182-82ea-ce62-c9f7-6ca988b7fcb9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB"
time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-25cabcf6-5d8a-1d2a-7df6-7d98094dc567 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB"
time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-d3fe759b-d280-173c-6a77-d1f00274ceaa library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB"
time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-733fbbf1-e0c3-5e16-66e7-f3a3e25ae8df library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB"
time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-d04b0338-309d-b9c8-c713-738b298f37e9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB"
time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-3f9559e8-f031-157a-5c51-b6b883621fc4 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB"
[GIN] 2024/12/08 - 14:04:42 | 200 |      65.463µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/12/08 - 14:04:42 | 200 |   43.088629ms |       127.0.0.1 | POST     "/api/show"
time=2024-12-08T14:04:42.313+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2024-12-08T14:04:46.402+02:00 level=INFO source=server.go:104 msg="system memory" total="3.7 GiB" free="2.8 GiB" free_swap="303.7 MiB"
time=2024-12-08T14:04:46.407+02:00 level=WARN source=server.go:136 msg="model request too large for system" requested="6.2 GiB" available=3294216192 total="3.7 GiB" free="2.8 GiB" swap="303.7 MiB"
time=2024-12-08T14:04:46.407+02:00 level=INFO source=sched.go:428 msg="NewLlamaServer failed" model=/home/lemuria/.ollama/models/blobs/sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 error="model requires more system memory (6.2 GiB) than is available (3.1 GiB)"

Thanks again...

<!-- gh-comment-id:2537489256 --> @osering commented on GitHub (Dec 12, 2024): > @osering please give more details ! model size you load..? command line ? api ? Do you have success with others models ? Hi **nav32**. Thanks for inquiring! 1) model used: llama3.2-vision 11b (7.9Gb), which gives message - **Error: model requires more system memory (6.2 GiB) than is available (3.1 GiB)** 2) problem is that it cannot load it and fails as refusing using VRAM and too small RAM (4GB total). Basically I run CLI: ollama run llama3.2-vision and also tried: curl http://localhost:11434/api/generate -d "{\"model\":\"llama3.2-vision",\"options\":{\"num_gpu\":40},\"prompt\":\"hello\",\"stream\":false}" and received: {"error":{"message":"model requires more system memory (6.2 GiB) than is available (3.1 GiB)","type":"api_error","param":null,"code":null}} so no difference. CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 ollama serve - did not help either. 3) No models loading in VRAM (tried also llama3.1:8b and it gave different message, although dmesg showed the same - **Error: llama runner process has terminated: error loading model: unable to allocate backend buffer llama_load_model_from_file: failed to load model**). Just small one (ollama run llama3.2:1b as well as :3b) run peacefully on this AVX-less Celeron 3865U CPU. When loading these bigger models, nvtop for a second or two shows that one/some of the GPUs switches from Graphic to Compute mode and there is jump of some + 60MB occupied VRAM and then back. New to this thing. Don't know which way to head to find out problems source. BTW **dmesg** showed errors: ``` [ 9575.935986] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-3.scope,task=nvidia-smi,pid=7908,uid=1000 [ 9575.936011] Out of memory: Killed process 7908 (nvidia-smi) total-vm:4215784kB, anon-rss:2324660kB, file-rss:20kB, shmem-rss:0kB, UID:1000 pgtables:4620kB oom_score_adj:0 [ 9797.291534] traps: ld.so[8064] trap invalid opcode ip:7a922d117d34 sp:7ffe7fc69a68 error:0 in llama-cpp-cuda[e8d34,7a922d0f3000+133c000] [ 9837.126515] traps: ld.so[8091] trap invalid opcode ip:7a4010835a24 sp:7ffe34fb9be8 error:0 in llama-cpp-fallback[e5a24,7a4010812000+1122000] [ 9877.298047] traps: llama-ggml[8117] trap divide error ip:7f9dfe36abb9 sp:7fffe42b8b20 error:0 in libc.so.6[175bb9,7f9dfe21d000+195000] [ 9917.198826] traps: ld.so[8143] trap invalid opcode ip:738420f57a24 sp:7ffd05918358 error:0 in llama-cpp-fallback[e5a24,738420f34000+1122000] [ 9957.384054] traps: piper[8177] trap divide error ip:74800c4c5bb9 sp:7fff152f63d0 error:0 in libc.so.6[175bb9,74800c378000+195000] [ 9997.434457] traps: rwkv[8208] trap divide error ip:7676a8910bb9 sp:7fffd8b71ea0 error:0 in libc.so.6[175bb9,7676a87c3000+195000] [10037.424512] traps: whisper[8237] trap divide error ip:71b8b14abbb9 sp:7ffcc1cfeba0 error:0 in libc.so.6[175bb9,71b8b135e000+195000] [10077.463686] traps: huggingface[8265] trap divide error ip:713e60ebfbb9 sp:7ffc8c50f5c0 error:0 in libc.so.6[175bb9,713e60d72000+195000] [10117.538563] traps: bert-embeddings[8315] trap divide error ip:7abbfd9bbbb9 sp:7ffc890c0e00 error:0 in libc.so.6[175bb9,7abbfd86e000+195000] [13081.069733] perf: interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [13212.173267] traps: ollama_llama_se[11163] trap invalid opcode ip:74527d0cdcb4 sp:7ffc47436198 error:0 in libggml.so[10cb4,74527d0cb000+9e000] [13616.379262] __vm_enough_memory: pid: 11592, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation [13616.379274] __vm_enough_memory: pid: 11592, comm: ollama_llama_se, bytes: 4912934912 not enough memory for the allocation [13616.379281] __vm_enough_memory: pid: 11592, comm: ollama_llama_se, bytes: 4913041408 not enough memory for the allocation [13616.379338] __vm_enough_memory: pid: 11592, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation [13772.697257] __vm_enough_memory: pid: 11833, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation [13772.697269] __vm_enough_memory: pid: 11833, comm: ollama_llama_se, bytes: 4912934912 not enough memory for the allocation [13772.697275] __vm_enough_memory: pid: 11833, comm: ollama_llama_se, bytes: 4913041408 not enough memory for the allocation [13772.697325] __vm_enough_memory: pid: 11833, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation [13799.229263] __vm_enough_memory: pid: 11990, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation [13799.229275] __vm_enough_memory: pid: 11990, comm: ollama_llama_se, bytes: 4912934912 not enough memory for the allocation [13799.229282] __vm_enough_memory: pid: 11990, comm: ollama_llama_se, bytes: 4913041408 not enough memory for the allocation [13799.229339] __vm_enough_memory: pid: 11990, comm: ollama_llama_se, bytes: 4912910336 not enough memory for the allocation [13861.607631] traps: ollama_llama_se[12214] trap invalid opcode ip:7b4839c82cb4 sp:7ffd891f7758 error:0 in libggml.so[10cb4,7b4839c80000+9e000] ``` can these TRAP ERRORS be the source of problem? Why these are on freshly installed Ubuntu 24.10 with freshly compiled Ollama? But after reboot these were gone. Shall I reinstall? and here is **ollama serve** response (no errors, but msg="Dynamic LLM libraries" runners=[cpu] makes alert): ``` $ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 ollama serve 2024/12/08 14:03:57 routes.go:1194: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1,2,3,4,5,6,7,8 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/lemuria/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-12-08T14:03:57.951+02:00 level=INFO source=images.go:753 msg="total blobs: 6" time=2024-12-08T14:03:57.951+02:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-12-08T14:03:57.951+02:00 level=INFO source=routes.go:1245 msg="Listening on 127.0.0.1:11434 (version 0.5.0-rc1-6-g1841309)" time=2024-12-08T14:03:57.952+02:00 level=INFO source=routes.go:1274 **msg="Dynamic LLM libraries" runners=[cpu]** time=2024-12-08T14:03:57.952+02:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-2e0afdf4-156c-0c78-2dc9-664178fafa81 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-5c60883b-8d5e-aa98-5f3f-5b2f0a6ed261 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7aed2640-3989-5e38-0bc1-dbc2d525942e library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-c7578182-82ea-ce62-c9f7-6ca988b7fcb9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-25cabcf6-5d8a-1d2a-7df6-7d98094dc567 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-d3fe759b-d280-173c-6a77-d1f00274ceaa library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-733fbbf1-e0c3-5e16-66e7-f3a3e25ae8df library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-d04b0338-309d-b9c8-c713-738b298f37e9 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" time=2024-12-08T14:04:00.304+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-3f9559e8-f031-157a-5c51-b6b883621fc4 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA P106-090" total="5.9 GiB" available="5.9 GiB" [GIN] 2024/12/08 - 14:04:42 | 200 | 65.463µs | 127.0.0.1 | HEAD "/" [GIN] 2024/12/08 - 14:04:42 | 200 | 43.088629ms | 127.0.0.1 | POST "/api/show" time=2024-12-08T14:04:42.313+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" time=2024-12-08T14:04:46.402+02:00 level=INFO source=server.go:104 msg="system memory" total="3.7 GiB" free="2.8 GiB" free_swap="303.7 MiB" time=2024-12-08T14:04:46.407+02:00 level=WARN source=server.go:136 msg="model request too large for system" requested="6.2 GiB" available=3294216192 total="3.7 GiB" free="2.8 GiB" swap="303.7 MiB" time=2024-12-08T14:04:46.407+02:00 level=INFO source=sched.go:428 msg="NewLlamaServer failed" model=/home/lemuria/.ollama/models/blobs/sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 error="model requires more system memory (6.2 GiB) than is available (3.1 GiB)" ``` Thanks again...
Author
Owner

@navr32 commented on GitHub (Dec 18, 2024):

@osering Have you success to stabilize your system ? because before trying anaything you must have a running system
without any trap on all process running...here i see too many.. and this is not ollama problems ...piper ..whisper..huggingface.

traps: piper[8177] trap divide error ip:74800c4c5bb9 sp:7fff152f63d0 error:0 in libc.so.6[1
whisper[8237] trap divide error ip:71b8b14abbb9 sp:7ffcc1cfeba0 error:0 in libc.so.6
traps: huggingface[8265] trap divide error ip:713e60ebfbb9 sp:7ffc8c50f5c0 error:0 in libc.so.6[175bb9,713e60d72000+195000]

Do you run in a vm ? Do you have set enough memory to the VM ? So try to control your RAM if this is not the problem search on power supply stability..processor overheating..bad kernel version ...and so and so..good luck.
After all of this if you always failed i think you have to post a new issue !

<!-- gh-comment-id:2552107936 --> @navr32 commented on GitHub (Dec 18, 2024): @osering Have you success to stabilize your system ? because before trying anaything you must have a running system without any trap on all process running...here i see too many.. and this is not ollama problems ...piper ..whisper..huggingface. ``` traps: piper[8177] trap divide error ip:74800c4c5bb9 sp:7fff152f63d0 error:0 in libc.so.6[1 whisper[8237] trap divide error ip:71b8b14abbb9 sp:7ffcc1cfeba0 error:0 in libc.so.6 traps: huggingface[8265] trap divide error ip:713e60ebfbb9 sp:7ffc8c50f5c0 error:0 in libc.so.6[175bb9,713e60d72000+195000] ``` Do you run in a vm ? Do you have set enough memory to the VM ? So try to control your RAM if this is not the problem search on power supply stability..processor overheating..bad kernel version ...and so and so..good luck. After all of this if you always failed i think you have to post a new issue !
Author
Owner

@kaplanski commented on GitHub (Dec 19, 2024):

Concerning this topic, let me just chime in with a workaround.

I tried my best to hack around the build scripts and such, but the compiled binary would always end up crashing upon loading a model (Illegal instruction, it tried doing AVX no matter what). After digging deep the workaround I found was to use Intel's Software Development Emulator (SDE) to emulate AVX. What you'd do is to download SDE, then install a stock copy of Ollama and modify the systemd ollama.service.

Change
ExecStart=/usr/local/bin/ollama serve
To
ExecStart=/path/to/intel/sde64 -hsw -- /usr/local/bin/ollama serve

This emulates an Haswell processor, including full AVX support. Loading a model takes a bit longer than normal due to the emulation overhead, but once the model has fully loaded and the first message has been sent it works and performs as expected.

According to the System Configuration section, yama needs to be disabled on Linux for SDE to work.
echo 0 > /proc/sys/kernel/yama/ptrace_scope
(Make this a service as well and make ollama.service run after it.)

I tried this on an Intel Xeon X5550 (Nehalem EP/Gainestown) from 2009 w/ an Nvidia Tesla P4.

<!-- gh-comment-id:2553382293 --> @kaplanski commented on GitHub (Dec 19, 2024): Concerning this topic, let me just chime in with a workaround. I tried my best to hack around the build scripts and such, but the compiled binary would always end up crashing upon loading a model (Illegal instruction, it tried doing AVX no matter what). After digging deep the workaround I found was to use [Intel's Software Development Emulator (SDE)](https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html) to emulate AVX. What you'd do is to download SDE, then install a stock copy of Ollama and modify the systemd ollama.service. Change `ExecStart=/usr/local/bin/ollama serve` To `ExecStart=/path/to/intel/sde64 -hsw -- /usr/local/bin/ollama serve` This emulates an Haswell processor, including full AVX support. Loading a model takes a bit longer than normal due to the emulation overhead, but once the model has fully loaded and the first message has been sent it works and performs as expected. According to the [System Configuration section](https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html#system-configuration), yama needs to be disabled on Linux for SDE to work. `echo 0 > /proc/sys/kernel/yama/ptrace_scope` (Make this a service as well and make ollama.service run after it.) I tried this on an Intel Xeon X5550 (Nehalem EP/Gainestown) from 2009 w/ an Nvidia Tesla P4.
Author
Owner

@osering commented on GitHub (Dec 22, 2024):

@osering Have you success to stabilize your system ? because before trying anaything you must have a running system without any trap on all process running...here i see too many.. and this is not ollama problems ...piper ..whisper..huggingface.

traps: piper[8177] trap divide error ip:74800c4c5bb9 sp:7fff152f63d0 error:0 in libc.so.6[1
whisper[8237] trap divide error ip:71b8b14abbb9 sp:7ffcc1cfeba0 error:0 in libc.so.6
traps: huggingface[8265] trap divide error ip:713e60ebfbb9 sp:7ffc8c50f5c0 error:0 in libc.so.6[175bb9,713e60d72000+195000]

As I wrote, these traps were on the first run. Afterwards - no traps in dmesg.
2 types of error messages:
for ollama run llama3.2-vision
Error: model requires more system memory (6.2 GiB) than is available (3.0 GiB)
for ollama run llama3.1:8b
Error: llama runner process has terminated: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model

Do you run in a vm ? Do you have set enough memory to the VM ? So try to control your RAM if this is not the problem search on power supply stability..processor overheating..bad kernel version ...and so and so..good luck. After all of this if you always failed i think you have to post a new issue !

No VMs. Just clean regular minimal Lubuntu 24.10 on AVX-less Celeron 3865U (inbuilt IGPU) with 4GB ram and 9 x Nvidia 106-90 5.9GB GPUs.
Linux rig2 6.11.0-13-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Sat Nov 30 23:51:51 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Nvidia proprietary driver: 560.35.03 CUDA Version: 12.6. Fresh Ollama (not upstreamed AVX-less version)

Probably have to try llama.ccp and/or kobold.ccp...

P.S. Researched also Intel's Software Development Emulator (SDE) to emulate AVX, but read that it will give significant overhead and did not try out:
https://www.intel.com/content/www/us/en/developer/articles/license/pre-relehttps://downloadmirror.intel.com/843185/sde-external-9.48.0-2024-11-25-lin.tar.xzase-license-agreement-for-software-development-emulator.html
https://www.intel.com/content/www/us/en/download/684897/intel-software-development-emulator.html
https://downloadmirror.intel.com/843185/sde-external-9.48.0-2024-11-25-lin.tar.xz
After revelations by kaplanski may be have to give a try it as well.
Although my slow PC compiled AVX-less Ollama like 8 hours, there were later no errors like crashing: Illegal instruction or trying to use AVX version.

P.P.S. Is there any info when this AVX-less version by @dhiltgen is planned to be upstreamed?

P.P.P.S. If useful for somebody, I can share compiled ollama, olama_lama_server and libggml_cuda_v12.so

<!-- gh-comment-id:2558565799 --> @osering commented on GitHub (Dec 22, 2024): > @osering Have you success to stabilize your system ? because before trying anaything you must have a running system without any trap on all process running...here i see too many.. and this is not ollama problems ...piper ..whisper..huggingface. > > ``` > traps: piper[8177] trap divide error ip:74800c4c5bb9 sp:7fff152f63d0 error:0 in libc.so.6[1 > whisper[8237] trap divide error ip:71b8b14abbb9 sp:7ffcc1cfeba0 error:0 in libc.so.6 > traps: huggingface[8265] trap divide error ip:713e60ebfbb9 sp:7ffc8c50f5c0 error:0 in libc.so.6[175bb9,713e60d72000+195000] > ``` As I wrote, these traps were on the first run. Afterwards - no traps in dmesg. 2 types of error messages: for ollama run llama3.2-vision **Error: model requires more system memory (6.2 GiB) than is available (3.0 GiB)** for ollama run llama3.1:8b **Error: llama runner process has terminated: error loading model: unable to allocate backend buffer llama_load_model_from_file: failed to load model** > > Do you run in a vm ? Do you have set enough memory to the VM ? So try to control your RAM if this is not the problem search on power supply stability..processor overheating..bad kernel version ...and so and so..good luck. After all of this if you always failed i think you have to post a new issue ! No VMs. Just clean regular minimal Lubuntu 24.10 on AVX-less Celeron 3865U (inbuilt IGPU) with 4GB ram and 9 x Nvidia 106-90 5.9GB GPUs. Linux rig2 6.11.0-13-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Sat Nov 30 23:51:51 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Nvidia proprietary driver: 560.35.03 CUDA Version: 12.6. Fresh Ollama (not upstreamed AVX-less version) Probably have to try llama.ccp and/or kobold.ccp... P.S. Researched also [Intel's Software Development Emulator (SDE)](https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html) to emulate AVX, but read that it will give significant overhead and did not try out: https://www.intel.com/content/www/us/en/developer/articles/license/pre-relehttps://downloadmirror.intel.com/843185/sde-external-9.48.0-2024-11-25-lin.tar.xzase-license-agreement-for-software-development-emulator.html https://www.intel.com/content/www/us/en/download/684897/intel-software-development-emulator.html https://downloadmirror.intel.com/843185/sde-external-9.48.0-2024-11-25-lin.tar.xz After revelations by [kaplanski](https://github.com/kaplanski) may be have to give a try it as well. Although my slow PC compiled AVX-less Ollama like 8 hours, there were later no errors like crashing: Illegal instruction or trying to use AVX version. P.P.S. Is there any info when this AVX-less version by [@dhiltgen](https://github.com/dhiltgen) is planned to be upstreamed? P.P.P.S. If useful for somebody, I can share compiled **ollama**, **olama_lama_server** and **libggml_cuda_v12.so**
Author
Owner

@akesterson commented on GitHub (Jan 8, 2025):

@kaplanski wrote

Concerning this topic, let me just chime in with a workaround.

I tried my best to hack around the build scripts and such, but the compiled binary would always end up crashing upon loading a model (Illegal instruction, it tried doing AVX no matter what). After digging deep the workaround I found was to use Intel's Software Development Emulator (SDE) to emulate AVX. What you'd do is to download SDE, then install a stock copy of Ollama and modify the systemd ollama.service.

Change ExecStart=/usr/local/bin/ollama serve To ExecStart=/path/to/intel/sde64 -hsw -- /usr/local/bin/ollama serve

... snip ...

Brilliant. This works on a Intel Xeon L5520 from 2009 as well. The models do indeed take a while to load but performance is quite good once the model is loaded and starts responding.

<!-- gh-comment-id:2577870066 --> @akesterson commented on GitHub (Jan 8, 2025): @kaplanski wrote > Concerning this topic, let me just chime in with a workaround. > > I tried my best to hack around the build scripts and such, but the compiled binary would always end up crashing upon loading a model (Illegal instruction, it tried doing AVX no matter what). After digging deep the workaround I found was to use [Intel's Software Development Emulator (SDE)](https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html) to emulate AVX. What you'd do is to download SDE, then install a stock copy of Ollama and modify the systemd ollama.service. > > Change `ExecStart=/usr/local/bin/ollama serve` To `ExecStart=/path/to/intel/sde64 -hsw -- /usr/local/bin/ollama serve` > > ... snip ... Brilliant. This works on a Intel Xeon L5520 from 2009 as well. The models do indeed take a while to load but performance is quite good once the model is loaded and starts responding.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#27010