[GH-ISSUE #4139] only 1 GPU found -- regression 1.32 -> 1.33 #64610

Closed
opened 2026-05-03 18:19:40 -05:00 by GiteaMirror · 25 comments
Owner

Originally created by @AlexLJordan on GitHub (May 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4139

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Hi everyone,

Sorry I don't have much time to write much; but going from 1.32 to 1.33, this:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
  Device 2: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.45 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:      CUDA0 buffer size =  1194.53 MiB
llm_load_tensors:      CUDA1 buffer size =  1194.53 MiB
llm_load_tensors:      CUDA2 buffer size =  1188.49 MiB

changed into this:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 3 repeating layers to GPU
llm_load_tensors: offloaded 3/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3647.87 MiB
llm_load_tensors:      CUDA0 buffer size =   325.78 MiB

1.33 hammers my CPU cores, is generally slower and doesn't even utilize the one GPU it does find properly.

I need the new concurrency features, so I'd really appreciate it if 1.33 worked on my machine.

Please help.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

1.33

Originally created by @AlexLJordan on GitHub (May 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4139 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Hi everyone, Sorry I don't have much time to write much; but going from 1.32 to 1.33, this: ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 3 CUDA devices: Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes Device 1: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes Device 2: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes llm_load_tensors: ggml ctx size = 0.45 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 1194.53 MiB llm_load_tensors: CUDA1 buffer size = 1194.53 MiB llm_load_tensors: CUDA2 buffer size = 1188.49 MiB ``` changed into this: ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 3 repeating layers to GPU llm_load_tensors: offloaded 3/33 layers to GPU llm_load_tensors: CPU buffer size = 3647.87 MiB llm_load_tensors: CUDA0 buffer size = 325.78 MiB ``` 1.33 hammers my CPU cores, is generally slower and doesn't even utilize the one GPU it *does* find properly. I need the new concurrency features, so I'd really appreciate it if 1.33 worked on my machine. Please help. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 1.33
GiteaMirror added the bugnvidia labels 2026-05-03 18:19:40 -05:00
Author
Owner

@dhiltgen commented on GitHub (May 3, 2024):

Can you share more of the server log, ideally with OLLAMA_DEBUG=1 set so we can see the early bootstrapping GPU discovery logic.

<!-- gh-comment-id:2093770457 --> @dhiltgen commented on GitHub (May 3, 2024): Can you share more of the server log, ideally with OLLAMA_DEBUG=1 set so we can see the early bootstrapping GPU discovery logic.
Author
Owner

@AlexLJordan commented on GitHub (May 3, 2024):

These are logs that I store automatically; so they don't have OLLAMA_DEBUG set. It's late here, so if these logs aren't helpful, I'll need to rerun it with DEBUG tomorrow.

Ollama 1.32
ollama-1.32.log

Ollama 1.33
ollama-1.33.log

Thanks for your help!

<!-- gh-comment-id:2093791368 --> @AlexLJordan commented on GitHub (May 3, 2024): These are logs that I store automatically; so they don't have OLLAMA_DEBUG set. It's late here, so if these logs aren't helpful, I'll need to rerun it with DEBUG tomorrow. Ollama 1.32 [ollama-1.32.log](https://github.com/ollama/ollama/files/15205209/ollama-1.32.log) Ollama 1.33 [ollama-1.33.log](https://github.com/ollama/ollama/files/15205210/ollama-1.33.log) Thanks for your help!
Author
Owner

@dhiltgen commented on GitHub (May 3, 2024):

From the logs I can see that we did discover all 3 GPUs

time=2024-05-03T22:22:34.769+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2048242415/runners/cuda_v11/libcudart.so.11.0 count=3

Unfortunately without the debug set, I can't see why the scheduler decided to run on only a single GPU with only 3 layers. If you can re-run just the 0.1.33 with OLLAMA_DEBUG=1 and share the log that will help root cause the defect.

<!-- gh-comment-id:2093860430 --> @dhiltgen commented on GitHub (May 3, 2024): From the logs I can see that we did discover all 3 GPUs ``` time=2024-05-03T22:22:34.769+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2048242415/runners/cuda_v11/libcudart.so.11.0 count=3 ``` Unfortunately without the debug set, I can't see why the scheduler decided to run on only a single GPU with only 3 layers. If you can re-run just the 0.1.33 with OLLAMA_DEBUG=1 and share the log that will help root cause the defect.
Author
Owner

@bsdnet commented on GitHub (May 4, 2024):

@dhiltgen seems the log you referred is from ollama-1.32.log

<!-- gh-comment-id:2093953233 --> @bsdnet commented on GitHub (May 4, 2024): @dhiltgen seems the log you referred is from [ollama-1.32.log](https://github.com/ollama/ollama/files/15205209/ollama-1.32.log)
Author
Owner

@AlexLJordan commented on GitHub (May 4, 2024):

Hi again!

I was able to rerun the workload with DEBUG enabled on both versions [see below].

Weirdly enough Ollama 1.33 uses a full GPU this time:
Selection_2024-05-04--001

It's still much slower than 1.32, where one set of jobs completes in half an hour; and 1.33 shows somewhere between 2 and 3.5h projected completion time.

<EDIT>
Additional weirdness:
Yesterdays run of 1.33 didn't have that msg="detected GPUs" library=/tmp/ollama2048242415/runners/cuda_v11/libcudart.so.11.0 count=3 line as bsdnet already pointed out.
But in the attached log the following line showed up:

time=2024-05-04T14:05:09.283+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2604474549/runners/cuda_v11/libcudart.so.11.0 count=3

</EDIT>

I'm relatively sure that the only changes from yesterday to today are adding this to the environment (.env) file where Ollama runs. (Turns out I had the Ollama EnvVars in the wrong file and NUM_PARALLEL as well as MAX_LOADED_MODELS weren't included in the environment yesterday.)

export OLLAMA_NUM_PARALLEL=16
export OLLAMA_MAX_LOADED_MODELS=3

export OLLAMA_DEBUG=1

ollama-1.32-DEBUG.log

ollama-1.33-DEBUG.log

<!-- gh-comment-id:2094144445 --> @AlexLJordan commented on GitHub (May 4, 2024): Hi again! I was able to rerun the workload with DEBUG enabled on both versions [see below]. Weirdly enough Ollama 1.33 uses a full GPU this time: ![Selection_2024-05-04--001](https://github.com/ollama/ollama/assets/10133257/a2e4441c-2f7b-4451-847e-077d7a5f4f7c) It's still _much_ slower than 1.32, where one set of jobs completes in half an hour; and 1.33 shows somewhere between 2 and 3.5h projected completion time. \<EDIT> Additional weirdness: Yesterdays run of 1.33 didn't have that `msg="detected GPUs" library=/tmp/ollama2048242415/runners/cuda_v11/libcudart.so.11.0 count=3` line as bsdnet already pointed out. But in the attached log the following line showed up: ``` time=2024-05-04T14:05:09.283+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2604474549/runners/cuda_v11/libcudart.so.11.0 count=3 ``` \</EDIT> I'm relatively sure that the only changes from yesterday to today are adding this to the environment (`.env`) file where Ollama runs. (Turns out I had the Ollama EnvVars in the wrong file and `NUM_PARALLEL` as well as `MAX_LOADED_MODELS` weren't included in the environment yesterday.) ``` export OLLAMA_NUM_PARALLEL=16 export OLLAMA_MAX_LOADED_MODELS=3 export OLLAMA_DEBUG=1 ``` --- [ollama-1.32-DEBUG.log](https://github.com/ollama/ollama/files/15209662/ollama-1.32-DEBUG.log) [ollama-1.33-DEBUG.log](https://github.com/ollama/ollama/files/15209661/ollama-1.33-DEBUG.log)
Author
Owner

@bsdnet commented on GitHub (May 4, 2024):

Not sure whether the issue comes from timing :)

Enabling debug usually means more logging; More logging usually means timing changed.

One way to confirm this is to run 1.33 without DEBUG enabled.

<!-- gh-comment-id:2094285996 --> @bsdnet commented on GitHub (May 4, 2024): Not sure whether the issue comes from timing :) Enabling debug usually means more logging; More logging usually means timing changed. One way to confirm this is to run 1.33 without DEBUG enabled.
Author
Owner

@dhiltgen commented on GitHub (May 4, 2024):

Based on your 0.1.33 log with debug enabled..

It sees all 3 GPUs:

time=2024-05-04T14:05:09.283+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2604474549/runners/cuda_v11/libcudart.so.11.0 count=3
time=2024-05-04T14:05:09.283+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] CUDA totalMem 34089730048
[GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] CUDA freeMem 33765720064
[GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] Compute Capability 7.0
[GPU-5105e575-3fba-efc4-c055-9b7051c99884] CUDA totalMem 34089730048
[GPU-5105e575-3fba-efc4-c055-9b7051c99884] CUDA freeMem 33765720064
[GPU-5105e575-3fba-efc4-c055-9b7051c99884] Compute Capability 7.0
[GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] CUDA totalMem 34089730048
[GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] CUDA freeMem 33765720064
[GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] Compute Capability 7.0

The scheduler determined the requested model could fit in a single GPU for best performance

time=2024-05-04T14:05:10.668+02:00 level=DEBUG source=sched.go:508 msg="new model will fit in available VRAM in single GPU, loading" model=/home/aljordan/.ollama/models/blobs/sha256-8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 gpu=GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7 available=33765720064 required="5222.6 MiB"

and we can see the backend loaded all the layers

ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
time=2024-05-04T14:05:10.922+02:00 level=DEBUG source=server.go:466 msg="server not yet available" error="server not responding"
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU

It is possible we have a scheduling race we haven't found/fixed yet since the scheduler code is brand new. If you manage to repro the failure mode of hitting a single GPU with partial offload, share the logs so we can see what the scheduler was doing.

<!-- gh-comment-id:2094365594 --> @dhiltgen commented on GitHub (May 4, 2024): Based on your 0.1.33 log with debug enabled.. It sees all 3 GPUs: ``` time=2024-05-04T14:05:09.283+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2604474549/runners/cuda_v11/libcudart.so.11.0 count=3 time=2024-05-04T14:05:09.283+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" [GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] CUDA totalMem 34089730048 [GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] CUDA freeMem 33765720064 [GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] Compute Capability 7.0 [GPU-5105e575-3fba-efc4-c055-9b7051c99884] CUDA totalMem 34089730048 [GPU-5105e575-3fba-efc4-c055-9b7051c99884] CUDA freeMem 33765720064 [GPU-5105e575-3fba-efc4-c055-9b7051c99884] Compute Capability 7.0 [GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] CUDA totalMem 34089730048 [GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] CUDA freeMem 33765720064 [GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] Compute Capability 7.0 ``` The scheduler determined the requested model could fit in a single GPU for best performance ``` time=2024-05-04T14:05:10.668+02:00 level=DEBUG source=sched.go:508 msg="new model will fit in available VRAM in single GPU, loading" model=/home/aljordan/.ollama/models/blobs/sha256-8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 gpu=GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7 available=33765720064 required="5222.6 MiB" ``` and we can see the backend loaded all the layers ``` ggml_cuda_init: found 1 CUDA devices: Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes time=2024-05-04T14:05:10.922+02:00 level=DEBUG source=server.go:466 msg="server not yet available" error="server not responding" llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU ``` It is possible we have a scheduling race we haven't found/fixed yet since the scheduler code is brand new. If you manage to repro the failure mode of hitting a single GPU with partial offload, share the logs so we can see what the scheduler was doing.
Author
Owner

@thevisad commented on GitHub (May 4, 2024):

I had the same issue today and rolled back to 1.31 and this resolved the issue. I spent the day in the discord chatting with the users, trying various things without resolution. I was able to up num_gpu to the amount required and it will then find and utilize both GPUs.


May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.005Z level=INFO source=images.go:828 msg="total blobs: 30"
May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.005Z level=INFO source=images.go:835 msg="total unused blobs removed: 0"
May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.006Z level=INFO source=routes.go:1071 msg="Listening on [::]:11434 (version 0.1.33)"
May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.006Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1884500785/runners
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.131Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11 rocm_v60002 cpu cpu_avx cpu_avx2]"
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.132Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.278Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1884500785/runners/cuda_v11/libcudart.so.11.0 count=2
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.278Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May  4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.533Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
May  4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.535Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1884500785/runners/cuda_v11/libcudart.so.11.0 count=2
May  4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.535Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="48402.9 MiB" memory.required.full="5222.6 MiB" memory.required.partial="5222.                                                               6 MiB" memory.required.kv="1024.0 MiB" memory.weights.total="3577.6 MiB" memory.weights.repeating="3475.0 MiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="193.0 MiB"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="48402.9 MiB" memory.required.full="5222.6 MiB" memory.required.partial="5222.                                                               6 MiB" memory.required.kv="1024.0 MiB" memory.weights.total="3577.6 MiB" memory.weights.repeating="3475.0 MiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="193.0 MiB"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1884500785/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-3a43                                                               f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 43549"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=sched.go:340 msg="loaded runners" count=1
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding"
May  4 15:52:03 prettygirl ollama[31772]: {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140550867894272","timestamp":1714837923}
May  4 15:52:03 prettygirl ollama[31772]: {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140550867894272","timestamp":1714837923}
May  4 15:52:03 prettygirl ollama[31772]: {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":6,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | N                                                               EON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140550867894272","timestamp":1714837923,"total_threads":12}
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2)
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   0:                       general.architecture str              = llama
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   1:                               general.name str              = codellama
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  19:               general.quantization_version u32              = 2
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type  f32:   65 tensors
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type q4_0:  225 tensors
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type q6_K:    1 tensors
May  4 15:52:03 prettygirl ollama[31666]: llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ).
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: format           = GGUF V2
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: arch             = llama
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: vocab type       = SPM
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_vocab          = 32016
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_merges         = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_ctx_train      = 16384
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd           = 4096
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_head           = 32
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_head_kv        = 32
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_layer          = 32
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_rot            = 128
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_head_k    = 128
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_head_v    = 128
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_gqa            = 1
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_k_gqa     = 4096
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_v_gqa     = 4096
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_norm_eps       = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_logit_scale    = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_ff             = 11008
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_expert         = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_expert_used    = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: causal attn      = 1
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: pooling type     = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope type        = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope scaling     = linear
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: freq_base_train  = 1000000.0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: freq_scale_train = 1
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_yarn_orig_ctx  = 16384
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope_finetuned   = unknown
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_conv       = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_inner      = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_state      = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_dt_rank      = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model type       = 7B
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model ftype      = Q4_0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model params     = 6.74 B
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: general.name     = codellama
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: BOS token        = 1 '<s>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: EOS token        = 2 '</s>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: UNK token        = 0 '<unk>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: LF token         = 13 '<0x0A>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: PRE token        = 32007 '▁<PRE>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: SUF token        = 32008 '▁<SUF>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: MID token        = 32009 '▁<MID>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: EOT token        = 32010 '▁<EOT>'
May  4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
May  4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
May  4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: found 1 CUDA devices:
May  4 15:52:03 prettygirl ollama[31666]:   Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: ggml ctx size =    0.30 MiB
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloading 32 repeating layers to GPU
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloading non-repeating layers to GPU
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloaded 33/33 layers to GPU
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors:        CPU buffer size =    70.35 MiB
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors:      CUDA0 buffer size =  3577.61 MiB
May  4 15:52:04 prettygirl ollama[31666]: ..................................................................................................
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_ctx      = 2048
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_batch    = 512
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_ubatch   = 512
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: freq_base  = 1000000.0
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: freq_scale = 1
May  4 15:52:04 prettygirl ollama[31666]: llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model:      CUDA0 compute buffer size =   164.00 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: graph nodes  = 1030
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: graph splits = 2

<!-- gh-comment-id:2094503673 --> @thevisad commented on GitHub (May 4, 2024): I had the same issue today and rolled back to 1.31 and this resolved the issue. I spent the day in the discord chatting with the users, trying various things without resolution. I was able to up num_gpu to the amount required and it will then find and utilize both GPUs. ``` May 4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.005Z level=INFO source=images.go:828 msg="total blobs: 30" May 4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.005Z level=INFO source=images.go:835 msg="total unused blobs removed: 0" May 4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.006Z level=INFO source=routes.go:1071 msg="Listening on [::]:11434 (version 0.1.33)" May 4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.006Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1884500785/runners May 4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.131Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11 rocm_v60002 cpu cpu_avx cpu_avx2]" May 4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.132Z level=INFO source=gpu.go:96 msg="Detecting GPUs" May 4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.278Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1884500785/runners/cuda_v11/libcudart.so.11.0 count=2 May 4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.278Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" May 4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.533Z level=INFO source=gpu.go:96 msg="Detecting GPUs" May 4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.535Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1884500785/runners/cuda_v11/libcudart.so.11.0 count=2 May 4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.535Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="48402.9 MiB" memory.required.full="5222.6 MiB" memory.required.partial="5222. 6 MiB" memory.required.kv="1024.0 MiB" memory.weights.total="3577.6 MiB" memory.weights.repeating="3475.0 MiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="193.0 MiB" May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="48402.9 MiB" memory.required.full="5222.6 MiB" memory.required.partial="5222. 6 MiB" memory.required.kv="1024.0 MiB" memory.weights.total="3577.6 MiB" memory.weights.repeating="3475.0 MiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="193.0 MiB" May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1884500785/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-3a43 f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 43549" May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=sched.go:340 msg="loaded runners" count=1 May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding" May 4 15:52:03 prettygirl ollama[31772]: {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140550867894272","timestamp":1714837923} May 4 15:52:03 prettygirl ollama[31772]: {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140550867894272","timestamp":1714837923} May 4 15:52:03 prettygirl ollama[31772]: {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":6,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | N EON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140550867894272","timestamp":1714837923,"total_threads":12} May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2) May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 0: general.architecture str = llama May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 1: general.name str = codellama May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 2: llama.context_length u32 = 16384 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 4: llama.block_count u32 = 32 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 11: general.file_type u32 = 2 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 12: tokenizer.ggml.model str = llama May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<... May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000... May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 19: general.quantization_version u32 = 2 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type f32: 65 tensors May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type q4_0: 225 tensors May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type q6_K: 1 tensors May 4 15:52:03 prettygirl ollama[31666]: llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ). May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: format = GGUF V2 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: arch = llama May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: vocab type = SPM May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_vocab = 32016 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_merges = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_ctx_train = 16384 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd = 4096 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_head = 32 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_head_kv = 32 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_layer = 32 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_rot = 128 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_head_k = 128 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_head_v = 128 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_gqa = 1 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_k_gqa = 4096 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_v_gqa = 4096 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_norm_eps = 0.0e+00 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_logit_scale = 0.0e+00 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_ff = 11008 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_expert = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_expert_used = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: causal attn = 1 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: pooling type = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope type = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope scaling = linear May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: freq_base_train = 1000000.0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: freq_scale_train = 1 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_yarn_orig_ctx = 16384 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope_finetuned = unknown May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_conv = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_inner = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_state = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_dt_rank = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model type = 7B May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model ftype = Q4_0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model params = 6.74 B May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: general.name = codellama May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: BOS token = 1 '<s>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: EOS token = 2 '</s>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: UNK token = 0 '<unk>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: LF token = 13 '<0x0A>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: PRE token = 32007 '▁<PRE>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: SUF token = 32008 '▁<SUF>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: MID token = 32009 '▁<MID>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: EOT token = 32010 '▁<EOT>' May 4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes May 4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no May 4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: found 1 CUDA devices: May 4 15:52:03 prettygirl ollama[31666]: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: ggml ctx size = 0.30 MiB May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloading 32 repeating layers to GPU May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloading non-repeating layers to GPU May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloaded 33/33 layers to GPU May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: CPU buffer size = 70.35 MiB May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: CUDA0 buffer size = 3577.61 MiB May 4 15:52:04 prettygirl ollama[31666]: .................................................................................................. May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_ctx = 2048 May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_batch = 512 May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_ubatch = 512 May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: freq_base = 1000000.0 May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: freq_scale = 1 May 4 15:52:04 prettygirl ollama[31666]: llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: CUDA_Host output buffer size = 0.14 MiB May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: CUDA0 compute buffer size = 164.00 MiB May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: graph nodes = 1030 May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: graph splits = 2 ```
Author
Owner

@JieChenSimon commented on GitHub (May 5, 2024):

same issue occurred to me when upgrade
image

<!-- gh-comment-id:2094722583 --> @JieChenSimon commented on GitHub (May 5, 2024): same issue occurred to me when upgrade <img width="1752" alt="image" src="https://github.com/ollama/ollama/assets/46218041/9a199c99-c84e-488c-9f58-c11320bfa446">
Author
Owner

@dhiltgen commented on GitHub (May 5, 2024):

@thevisad and @JieChenSimon from what I can tell, the system is behaving as expected in your examples. We try NOT to spread a single model over multiple GPUs now as that actually makes things run slower, not faster if the model could fit within one GPU. We now only spread a model to multiple GPUs if it wont fit in a single GPU. If that's not the behavior you're seeing, can you clarify?

<!-- gh-comment-id:2094976189 --> @dhiltgen commented on GitHub (May 5, 2024): @thevisad and @JieChenSimon from what I can tell, the system is behaving as expected in your examples. We try NOT to spread a single model over multiple GPUs now as that actually makes things run slower, not faster if the model could fit within one GPU. We now only spread a model to multiple GPUs if it wont fit in a single GPU. If that's not the behavior you're seeing, can you clarify?
Author
Owner

@wlsoft2006 commented on GitHub (May 6, 2024):

image

<!-- gh-comment-id:2095310917 --> @wlsoft2006 commented on GitHub (May 6, 2024): ![image](https://github.com/ollama/ollama/assets/4631931/7e36ccb3-956f-47ed-9af6-8363ee04df99)
Author
Owner

@wlsoft2006 commented on GitHub (May 6, 2024):

only one gpu in use after update to 1.33
Linux ai-centos7 3.10.0-1160.114.2.el7.x86_64 #1 SMP Wed Mar 20 15:54:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux release 7.9.2009 (Core)

<!-- gh-comment-id:2095312849 --> @wlsoft2006 commented on GitHub (May 6, 2024): only one gpu in use after update to 1.33 Linux ai-centos7 3.10.0-1160.114.2.el7.x86_64 #1 SMP Wed Mar 20 15:54:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux CentOS Linux release 7.9.2009 (Core)
Author
Owner

@wlsoft2006 commented on GitHub (May 6, 2024):

image
When I load both models at the same time it works!
That's all I need, No problem!

<!-- gh-comment-id:2095974945 --> @wlsoft2006 commented on GitHub (May 6, 2024): ![image](https://github.com/ollama/ollama/assets/4631931/f7831d31-19b1-4386-91c0-1bbae797a4e8) When I load both models at the same time it works! That's all I need, No problem!
Author
Owner

@nyoma-diamond commented on GitHub (May 8, 2024):

EDIT: Turned out to be user error. My system's administrator for some reason decided to set the CUDA_VISIBLE_DEVICES environment variable for each user so they could only access one specific GPU (I happened to be specifically set to GPU 1). I thought I had CUDA_VISIBLE_DEVICES unset but when I checked again on a fresh bash session it was set to the device ID for GPU 1. Unsetting the variable or adding the IDs of other GPUs resolved this.

I'm also running into this problem. The system I am using has 4x Nvidia P100s but Ollama only sees one at any given moment (from what I can tell, always GPU 1, not 0, 2, or 3). However, I'm observing this behavior on both v0.1.32 and v0.1.34

time=2024-05-08T14:44:14.522+01:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
time=2024-05-08T14:44:14.575+01:00 level=INFO source=gpu.go:127 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.54.15

Output of nvidia-smi (abbreviated):

Wed May  8 14:49:06 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P100-PCIE-16GB           Off |   00000000:25:00.0 Off |                    0 |
| N/A   32C    P0             27W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           Off |   00000000:5B:00.0 Off |                    0 |
| N/A   50C    P0             42W /  250W |    5254MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla P100-PCIE-16GB           Off |   00000000:9B:00.0 Off |                    0 |
| N/A   33C    P0             26W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla P100-PCIE-16GB           Off |   00000000:C8:00.0 Off |                    0 |
| N/A   33C    P0             33W /  250W |     288MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

As a result, large models get partially loaded onto one GPU and any excess is offloaded to CPU instead of using the remaining three GPUs. In logs Ollama says it only detects the one GPU. Occurs both on v0.1.32 and v0.1.34 with or without OLLAMA_DEBUG enabled.

It may be worth noting that the GPU that Ollama detects is always GPU 1 (as listed in nvidia-smi). Since this system is shared across multiple users, this also causes problems when someone is already using the selected GPU, causing Ollama to offload the entire model to the CPU, rather than using any of the other completely free GPUs.

<!-- gh-comment-id:2100627124 --> @nyoma-diamond commented on GitHub (May 8, 2024): **EDIT: Turned out to be user error. My system's administrator for some reason decided to set the `CUDA_VISIBLE_DEVICES` environment variable for each user so they could only access one specific GPU (I happened to be specifically set to GPU 1). I thought I had `CUDA_VISIBLE_DEVICES` unset but when I checked again on a fresh bash session it was set to the device ID for GPU 1. Unsetting the variable or adding the IDs of other GPUs resolved this.** ~~I'm also running into this problem. The system I am using has 4x Nvidia P100s but Ollama only sees one at any given moment (from what I can tell, always GPU 1, not 0, 2, or 3). However, I'm observing this behavior on both v0.1.32 and v0.1.34~~ ``` time=2024-05-08T14:44:14.522+01:00 level=INFO source=gpu.go:122 msg="Detecting GPUs" time=2024-05-08T14:44:14.575+01:00 level=INFO source=gpu.go:127 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.54.15 ``` ~~Output of `nvidia-smi` (abbreviated):~~ ``` Wed May 8 14:49:06 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla P100-PCIE-16GB Off | 00000000:25:00.0 Off | 0 | | N/A 32C P0 27W / 250W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 Tesla P100-PCIE-16GB Off | 00000000:5B:00.0 Off | 0 | | N/A 50C P0 42W / 250W | 5254MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 Tesla P100-PCIE-16GB Off | 00000000:9B:00.0 Off | 0 | | N/A 33C P0 26W / 250W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 Tesla P100-PCIE-16GB Off | 00000000:C8:00.0 Off | 0 | | N/A 33C P0 33W / 250W | 288MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` ~~As a result, large models get partially loaded onto one GPU and any excess is offloaded to CPU instead of using the remaining three GPUs. In logs Ollama says it only detects the one GPU. Occurs both on v0.1.32 and v0.1.34 with or without OLLAMA_DEBUG enabled.~~ ~~It may be worth noting that the GPU that Ollama detects is always GPU 1 (as listed in `nvidia-smi`). Since this system is shared across multiple users, this also causes problems when someone is already using the selected GPU, causing Ollama to offload the entire model to the CPU, rather than using any of the other completely free GPUs.~~
Author
Owner

@dhiltgen commented on GitHub (May 8, 2024):

I'm working on a change that will expose this setting in the logs during startup so it's easier to spot misconfigurations.

What I also noticed is we have a regression in 0.1.34 where CUDA_VISIBLE_DEVICES is no longer filtering out GPUs since we switched from the cuda runtime library to the nvidia driver library in the latest release. I'll look at adding a fix for that in the PR as well.

Update: my test was incorrect, CUDA_VISIBLE_DEVICES is still working properly.

<!-- gh-comment-id:2101184359 --> @dhiltgen commented on GitHub (May 8, 2024): I'm working on a change that will expose this setting in the logs during startup so it's easier to spot misconfigurations. ~~What I also noticed is we have a regression in 0.1.34 where CUDA_VISIBLE_DEVICES is no longer filtering out GPUs since we switched from the cuda runtime library to the nvidia driver library in the latest release. I'll look at adding a fix for that in the PR as well.~~ **Update**: my test was incorrect, CUDA_VISIBLE_DEVICES is still working properly.
Author
Owner

@ToRvaLDz commented on GitHub (May 20, 2024):

I have the same problem in docker, I have 13 gpus but it only find 1:

ggml_cuda_init: found 1 CUDA devices:  Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes
NVIDIA_VISIBLE_DEVICES=all
HOSTNAME=502558fb132a
PWD=/
NVIDIA_DRIVER_CAPABILITIES=compute,utility
OLLAMA_MAX_LOADED_MODELS=3
CUDA_VISIBLE_DEVICES=12
OLLAMA_HOST=0.0.0.0
TERM=xterm
SHLVL=1
OLLAMA_NUM_PARALLEL=12
OLLAMA_DEBUG=0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
OLLAMA_KEEP_ALIVE=24h
_=/usr/bin/env
time=2024-05-20T11:50:34.699Z level=INFO source=images.go:704 msg="total blobs: 5"
time=2024-05-20T11:50:34.700Z level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-05-20T11:50:34.701Z level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)"
time=2024-05-20T11:50:34.701Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1665127007/runners
time=2024-05-20T11:50:38.352Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]"
time=2024-05-20T11:50:44.849Z level=INFO source=types.go:71 msg="inference compute" id=GPU-4faa64e1-cd46-b533-d16d-c39809fde7ac library=cuda compute=7.5 driver=12.4 name="NVIDIA GeForce GTX 1660 SUPER" total="5.8 GiB" available="5.7 GiB"                                  ttl=64 time=0.732 ms
[GIN] 2024/05/20 - 11:51:13 | 200 |     572.484µs |      172.18.0.1 | GET      "/api/tags"
[GIN] 2024/05/20 - 11:51:13 | 200 |     346.879µs |      172.18.0.1 | GET      "/api/tags"
[GIN] 2024/05/20 - 11:51:13 | 200 |     274.741µs |      172.18.0.1 | GET      "/api/tags"                          8: INFO server config env="map[OLLAMA_DEBUG:false 
[GIN] 2024/05/20 - 11:51:13 | 200 |      29.468µs |      172.18.0.1 | GET      "/api/version"                       M_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost http
[GIN] 2024/05/20 - 11:51:20 | 200 |      90.274µs |      172.18.0.1 | GET      "/api/version"                       27.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0
time=2024-05-20T11:51:25.103Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB"
time=2024-05-20T11:51:25.104Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB"
time=2024-05-20T11:51:25.105Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB"
time=2024-05-20T11:51:25.105Z level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama1665127007/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 --ctx-size 24576 --batch-size 512 --embedding --log-disable --n-gpu-layers 16 --parallel 12 --port 45825"
time=2024-05-20T11:51:25.105Z level=INFO source=sched.go:338 msg="loaded runners" count=1                           3 memory.available="5.7 GiB" memory.required.full=
time=2024-05-20T11:51:25.105Z level=INFO source=server.go:504 msg="waiting for llama runner to start responding"    ghts.repeating="3.7 GiB" memory.weights.nonrepeati
time=2024-05-20T11:51:25.106Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="952d03d" tid="134968727871488" timestamp=1716205885                        1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668
INFO [main] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134968727871488" timestamp=1716205885 total_threads=4
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="14" port="45825" tid="134968727871488" timestamp=1716205885
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                    | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | N
llama_model_loader: - kv   0:                       general.architecture str              = llama                   INT8 = 0 | LLAMAFILE = 1 | " tid="129685691437056"
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct | hostname="127.0.0.1" n_threads_http="3" port="4
llama_model_loader: - kv   2:                          llama.block_count u32              = 32                      ata with 21 key-value pairs and 291 tensors from /
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192                    latest))
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2                        1, 1, 1, 1, ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-05-20T11:51:25.357Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4437.80 MiB
llm_load_tensors:      CUDA0 buffer size =  1872.50 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 24576
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  1536.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1536.00 MiB
llama_new_context_with_model: KV self size  = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     6.06 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1705.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    56.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 180

Inside the docker container:

Mon May 20 11:54:05 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1660 ...    Off |   00000000:01:00.0 Off |                  N/A |
| 40%   46C    P0             26W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1660 ...    Off |   00000000:04:00.0 Off |                  N/A |
| 37%   43C    P0             30W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce GTX 1660 ...    Off |   00000000:06:00.0 Off |                  N/A |
| 39%   43C    P0             31W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce GTX 1660 ...    Off |   00000000:08:00.0 Off |                  N/A |
| 42%   47C    P0             28W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce GTX 1660 ...    Off |   00000000:09:00.0 Off |                  N/A |
| 41%   44C    P0             31W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0A:00.0 Off |                  N/A |
| 36%   41C    P0             33W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0B:00.0 Off |                  N/A |
| 45%   43C    P0             30W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0C:00.0 Off |                  N/A |
| 43%   45C    P0             31W /  125W |       1MiB /   6144MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   8  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0D:00.0 Off |                  N/A |
| 43%   44C    P0             30W /  125W |       1MiB /   6144MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   9  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0E:00.0 Off |                  N/A |
| 22%   44C    P0             28W /  125W |       1MiB /   6144MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|  10  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0F:00.0 Off |                  N/A |
| 29%   44C    P0             31W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|  11  NVIDIA GeForce GTX 1660 ...    Off |   00000000:10:00.0 Off |                  N/A |
| 20%   43C    P0             32W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|  12  NVIDIA GeForce GTX 1660 ...    Off |   00000000:11:00.0 Off |                  N/A |
| 45%   47C    P2             30W /  125W |    5441MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
<!-- gh-comment-id:2120318378 --> @ToRvaLDz commented on GitHub (May 20, 2024): I have the same problem in docker, I have 13 gpus but it only find 1: ``` ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes ``` ```root@502558fb132a:/# env NVIDIA_VISIBLE_DEVICES=all HOSTNAME=502558fb132a PWD=/ NVIDIA_DRIVER_CAPABILITIES=compute,utility OLLAMA_MAX_LOADED_MODELS=3 CUDA_VISIBLE_DEVICES=12 OLLAMA_HOST=0.0.0.0 TERM=xterm SHLVL=1 OLLAMA_NUM_PARALLEL=12 OLLAMA_DEBUG=0 LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_KEEP_ALIVE=24h _=/usr/bin/env ``` ```024/05/20 11:50:34 routes.go:1008: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:3 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:12 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]" time=2024-05-20T11:50:34.699Z level=INFO source=images.go:704 msg="total blobs: 5" time=2024-05-20T11:50:34.700Z level=INFO source=images.go:711 msg="total unused blobs removed: 0" time=2024-05-20T11:50:34.701Z level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)" time=2024-05-20T11:50:34.701Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1665127007/runners time=2024-05-20T11:50:38.352Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]" time=2024-05-20T11:50:44.849Z level=INFO source=types.go:71 msg="inference compute" id=GPU-4faa64e1-cd46-b533-d16d-c39809fde7ac library=cuda compute=7.5 driver=12.4 name="NVIDIA GeForce GTX 1660 SUPER" total="5.8 GiB" available="5.7 GiB" ttl=64 time=0.732 ms [GIN] 2024/05/20 - 11:51:13 | 200 | 572.484µs | 172.18.0.1 | GET "/api/tags" [GIN] 2024/05/20 - 11:51:13 | 200 | 346.879µs | 172.18.0.1 | GET "/api/tags" [GIN] 2024/05/20 - 11:51:13 | 200 | 274.741µs | 172.18.0.1 | GET "/api/tags" 8: INFO server config env="map[OLLAMA_DEBUG:false [GIN] 2024/05/20 - 11:51:13 | 200 | 29.468µs | 172.18.0.1 | GET "/api/version" M_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost http [GIN] 2024/05/20 - 11:51:20 | 200 | 90.274µs | 172.18.0.1 | GET "/api/version" 27.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0 time=2024-05-20T11:51:25.103Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB" time=2024-05-20T11:51:25.104Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB" time=2024-05-20T11:51:25.105Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB" time=2024-05-20T11:51:25.105Z level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama1665127007/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 --ctx-size 24576 --batch-size 512 --embedding --log-disable --n-gpu-layers 16 --parallel 12 --port 45825" time=2024-05-20T11:51:25.105Z level=INFO source=sched.go:338 msg="loaded runners" count=1 3 memory.available="5.7 GiB" memory.required.full= time=2024-05-20T11:51:25.105Z level=INFO source=server.go:504 msg="waiting for llama runner to start responding" ghts.repeating="3.7 GiB" memory.weights.nonrepeati time=2024-05-20T11:51:25.106Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="952d03d" tid="134968727871488" timestamp=1716205885 1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668 INFO [main] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134968727871488" timestamp=1716205885 total_threads=4 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="14" port="45825" tid="134968727871488" timestamp=1716205885 llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | N llama_model_loader: - kv 0: general.architecture str = llama INT8 = 0 | LLAMAFILE = 1 | " tid="129685691437056" llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct | hostname="127.0.0.1" n_threads_http="3" port="4 llama_model_loader: - kv 2: llama.block_count u32 = 32 ata with 21 key-value pairs and 291 tensors from / llama_model_loader: - kv 3: llama.context_length u32 = 8192 latest)) llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 20: general.quantization_version u32 = 2 1, 1, 1, 1, ... llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-05-20T11:51:25.357Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab: llm_load_vocab: ************************************ llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED! llm_load_vocab: CONSIDER REGENERATING THE MODEL llm_load_vocab: ************************************ llm_load_vocab: llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 16 repeating layers to GPU llm_load_tensors: offloaded 16/33 layers to GPU llm_load_tensors: CPU buffer size = 4437.80 MiB llm_load_tensors: CUDA0 buffer size = 1872.50 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 24576 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 1536.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1536.00 MiB llama_new_context_with_model: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 6.06 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1705.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 56.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 180 ``` Inside the docker container: ```root@502558fb132a:/# nvidia-smi Mon May 20 11:54:05 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GTX 1660 ... Off | 00000000:01:00.0 Off | N/A | | 40% 46C P0 26W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce GTX 1660 ... Off | 00000000:04:00.0 Off | N/A | | 37% 43C P0 30W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce GTX 1660 ... Off | 00000000:06:00.0 Off | N/A | | 39% 43C P0 31W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce GTX 1660 ... Off | 00000000:08:00.0 Off | N/A | | 42% 47C P0 28W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA GeForce GTX 1660 ... Off | 00000000:09:00.0 Off | N/A | | 41% 44C P0 31W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA GeForce GTX 1660 ... Off | 00000000:0A:00.0 Off | N/A | | 36% 41C P0 33W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA GeForce GTX 1660 ... Off | 00000000:0B:00.0 Off | N/A | | 45% 43C P0 30W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA GeForce GTX 1660 ... Off | 00000000:0C:00.0 Off | N/A | | 43% 45C P0 31W / 125W | 1MiB / 6144MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 8 NVIDIA GeForce GTX 1660 ... Off | 00000000:0D:00.0 Off | N/A | | 43% 44C P0 30W / 125W | 1MiB / 6144MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 9 NVIDIA GeForce GTX 1660 ... Off | 00000000:0E:00.0 Off | N/A | | 22% 44C P0 28W / 125W | 1MiB / 6144MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 10 NVIDIA GeForce GTX 1660 ... Off | 00000000:0F:00.0 Off | N/A | | 29% 44C P0 31W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 11 NVIDIA GeForce GTX 1660 ... Off | 00000000:10:00.0 Off | N/A | | 20% 43C P0 32W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 12 NVIDIA GeForce GTX 1660 ... Off | 00000000:11:00.0 Off | N/A | | 45% 47C P2 30W / 125W | 5441MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+ ```
Author
Owner

@dhiltgen commented on GitHub (May 20, 2024):

@ToRvaLDz CUDA_VISIBLE_DEVICES=12 will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12

<!-- gh-comment-id:2121023680 --> @dhiltgen commented on GitHub (May 20, 2024): @ToRvaLDz `CUDA_VISIBLE_DEVICES=12` will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12`
Author
Owner

@ToRvaLDz commented on GitHub (May 21, 2024):

@ToRvaLDz CUDA_VISIBLE_DEVICES=12 will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12

I'm sorry, you a re right. Thank you.

<!-- gh-comment-id:2122073872 --> @ToRvaLDz commented on GitHub (May 21, 2024): > @ToRvaLDz `CUDA_VISIBLE_DEVICES=12` will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12` I'm sorry, you a re right. Thank you.
Author
Owner

@dhiltgen commented on GitHub (May 21, 2024):

I'm going to mark this one closed now as the visible devices env var seems to be working properly. I am working on some improvements in concurrency memory predictions that help when operating at near max vram allocation, which should land in an upcoming release.

<!-- gh-comment-id:2122883956 --> @dhiltgen commented on GitHub (May 21, 2024): I'm going to mark this one closed now as the visible devices env var seems to be working properly. I am working on some improvements in concurrency memory predictions that help when operating at near max vram allocation, which should land in an upcoming release.
Author
Owner

@techResearcher2021 commented on GitHub (Jun 3, 2024):

It does not work inside docker container with exposing the env var CUDA_VISIBLE_DEVICES=0,1, I use the docker image of 0.1.41, with dual RTX 4090.
Here shows part of the logs:
time=2024-06-03T11:29:50.606Z level=INFO source=types.go:71 msg="inference compute" id=GPU-70127701-8921-747f-9194-ce6a8699d820 library=cuda compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="23.2 GiB"
time=2024-06-03T11:29:50.606Z level=INFO source=types.go:71 msg="inference compute" id=GPU-61837e28-1bfe-a560-ddd2-0a14a55cf642 library=cuda compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="23.3 GiB"
time=2024-06-03T11:30:02.386Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=65 memory.available="23.3 GiB" memory.required.full="22.8 GiB" memory.required.partial="22.8 GiB" memory.required.kv="4.0 GiB" memory.weights.total="16.8 GiB" memory.weights.repeating="16.2 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB"
time=2024-06-03T11:30:02.388Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=65 memory.available="23.3 GiB" memory.required.full="22.8 GiB" memory.required.partial="22.8 GiB" memory.required.kv="4.0 GiB" memory.weights.total="16.8 GiB" memory.weights.repeating="16.2 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB"
time=2024-06-03T11:30:02.388Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama2149316569/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-0688760683b9ca390070d62d06bdba06593d200cf07456478e4baeb66655c64b --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 65 --flash-attn --parallel 2 --port 45911"
time=2024-06-03T11:30:02.389Z level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-03T11:30:02.389Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-03T11:30:02.389Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="5921b8f" tid="140422165536768" timestamp=1717414202
INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140422165536768" timestamp=1717414202 total_threads=80
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="45911" tid="140422165536768" timestamp=1717414202
llama_model_loader: loaded meta data with 20 key-value pairs and 771 tensors from /root/.ollama/models/blobs/sha256-0688760683b9ca390070d62d06bdba06593d200cf07456478e4baeb66655c64b (version GGUF V3 (latest))
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 0.74 MiB
time=2024-06-03T11:30:04.096Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server not responding"
time=2024-06-03T11:30:04.801Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"

<!-- gh-comment-id:2144965891 --> @techResearcher2021 commented on GitHub (Jun 3, 2024): It does not work inside docker container with exposing the env var CUDA_VISIBLE_DEVICES=0,1, I use the docker image of 0.1.41, with dual RTX 4090. Here shows part of the logs: time=2024-06-03T11:29:50.606Z level=INFO source=types.go:71 msg="inference compute" id=GPU-70127701-8921-747f-9194-ce6a8699d820 library=cuda compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="23.2 GiB" time=2024-06-03T11:29:50.606Z level=INFO source=types.go:71 msg="inference compute" id=GPU-61837e28-1bfe-a560-ddd2-0a14a55cf642 library=cuda compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="23.3 GiB" time=2024-06-03T11:30:02.386Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=65 memory.available="23.3 GiB" memory.required.full="22.8 GiB" memory.required.partial="22.8 GiB" memory.required.kv="4.0 GiB" memory.weights.total="16.8 GiB" memory.weights.repeating="16.2 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB" time=2024-06-03T11:30:02.388Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=65 memory.available="23.3 GiB" memory.required.full="22.8 GiB" memory.required.partial="22.8 GiB" memory.required.kv="4.0 GiB" memory.weights.total="16.8 GiB" memory.weights.repeating="16.2 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB" time=2024-06-03T11:30:02.388Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama2149316569/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-0688760683b9ca390070d62d06bdba06593d200cf07456478e4baeb66655c64b --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 65 --flash-attn --parallel 2 --port 45911" time=2024-06-03T11:30:02.389Z level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-03T11:30:02.389Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-03T11:30:02.389Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="5921b8f" tid="140422165536768" timestamp=1717414202 INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140422165536768" timestamp=1717414202 total_threads=80 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="45911" tid="140422165536768" timestamp=1717414202 llama_model_loader: loaded meta data with 20 key-value pairs and 771 tensors from /root/.ollama/models/blobs/sha256-0688760683b9ca390070d62d06bdba06593d200cf07456478e4baeb66655c64b (version GGUF V3 (latest)) ... ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: **Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes** llm_load_tensors: ggml ctx size = 0.74 MiB time=2024-06-03T11:30:04.096Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server not responding" time=2024-06-03T11:30:04.801Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"
Author
Owner

@dhiltgen commented on GitHub (Jun 4, 2024):

@techResearcher2021 the model you're loading fits in 1 GPU, so it's only using 1. If you tried to load a larger model that needs more VRAM than one of your GPUs, it would use both.

There's a feature enhancement tracking allowing spread even when the model fits in one GPU tracked via #4198

<!-- gh-comment-id:2147965984 --> @dhiltgen commented on GitHub (Jun 4, 2024): @techResearcher2021 the model you're loading fits in 1 GPU, so it's only using 1. If you tried to load a larger model that needs more VRAM than one of your GPUs, it would use both. There's a feature enhancement tracking allowing spread even when the model fits in one GPU tracked via #4198
Author
Owner

@userbox020 commented on GitHub (Aug 27, 2024):

whats the equivalent for expose cuda devices on AMD, im hain same problem but with my amd cards

<!-- gh-comment-id:2313621018 --> @userbox020 commented on GitHub (Aug 27, 2024): whats the equivalent for expose cuda devices on AMD, im hain same problem but with my amd cards
Author
Owner

@dhiltgen commented on GitHub (Sep 3, 2024):

@userbox020 see https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection-1

<!-- gh-comment-id:2327137444 --> @dhiltgen commented on GitHub (Sep 3, 2024): @userbox020 see https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection-1
Author
Owner

@userbox020 commented on GitHub (Sep 25, 2024):

@userbox020 see https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection-1

thanks bro, you the best!

<!-- gh-comment-id:2375359824 --> @userbox020 commented on GitHub (Sep 25, 2024): > @userbox020 see https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection-1 thanks bro, you the best!
Author
Owner

@accqaz commented on GitHub (Jan 10, 2025):

@ToRvaLDz CUDA_VISIBLE_DEVICES=12 will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12

Hello! I use OLLAMA_FLASH_ATTENTION=1 CUDA_VISIBLE_DEVICES=0,1 bin/ollama serve,but I met the question Error: listen tcp 127.0.0.1:11434: bind: address already in use. Could you please help how to slove it? I want to use run qwen2.5-72b model, it can only detect one device: the card RTXA6000, but it was to slow and often runtime error. I want to ask how to speed it up? (No docker, just use ollama serve. Thank you very much!

<!-- gh-comment-id:2582637110 --> @accqaz commented on GitHub (Jan 10, 2025): > @ToRvaLDz `CUDA_VISIBLE_DEVICES=12` will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12` Hello! I use `OLLAMA_FLASH_ATTENTION=1 CUDA_VISIBLE_DEVICES=0,1 bin/ollama serve`,but I met the question `Error: listen tcp 127.0.0.1:11434: bind: address already in use`. Could you please help how to slove it? I want to use run qwen2.5-72b model, it can only detect one device: the card RTXA6000, but it was to slow and often runtime error. I want to ask how to speed it up? (No docker, just use `ollama serve`. Thank you very much!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64610