[GH-ISSUE #4139] only 1 GPU found -- regression 1.32 -> 1.33 #28332

New Issue

GiteaMirror · 2026-04-22T06:25:50-05:00

GiteaMirror commented

2026-04-22 06:25:50 -05:00

Originally created by @AlexLJordan on GitHub (May 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4139

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Hi everyone,

Sorry I don't have much time to write much; but going from 1.32 to 1.33, this:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
  Device 2: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.45 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:      CUDA0 buffer size =  1194.53 MiB
llm_load_tensors:      CUDA1 buffer size =  1194.53 MiB
llm_load_tensors:      CUDA2 buffer size =  1188.49 MiB

changed into this:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 3 repeating layers to GPU
llm_load_tensors: offloaded 3/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3647.87 MiB
llm_load_tensors:      CUDA0 buffer size =   325.78 MiB

1.33 hammers my CPU cores, is generally slower and doesn't even utilize the one GPU it does find properly.

I need the new concurrency features, so I'd really appreciate it if 1.33 worked on my machine.

Please help.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

1.33

Originally created by @AlexLJordan on GitHub (May 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4139 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Hi everyone, Sorry I don't have much time to write much; but going from 1.32 to 1.33, this: ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 3 CUDA devices: Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes Device 1: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes Device 2: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes llm_load_tensors: ggml ctx size = 0.45 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 1194.53 MiB llm_load_tensors: CUDA1 buffer size = 1194.53 MiB llm_load_tensors: CUDA2 buffer size = 1188.49 MiB ``` changed into this: ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 3 repeating layers to GPU llm_load_tensors: offloaded 3/33 layers to GPU llm_load_tensors: CPU buffer size = 3647.87 MiB llm_load_tensors: CUDA0 buffer size = 325.78 MiB ``` 1.33 hammers my CPU cores, is generally slower and doesn't even utilize the one GPU it *does* find properly. I need the new concurrency features, so I'd really appreciate it if 1.33 worked on my machine. Please help. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 1.33

GiteaMirror added the nvidia bug labels 2026-04-22 06:25:50 -05:00

GiteaMirror closed this issue

2026-04-22 06:25:51 -05:00

GiteaMirror commented

2026-04-22 06:25:53 -05:00

@dhiltgen commented on GitHub (May 3, 2024):

Can you share more of the server log, ideally with OLLAMA_DEBUG=1 set so we can see the early bootstrapping GPU discovery logic.

@dhiltgen commented on GitHub (May 3, 2024): Can you share more of the server log, ideally with OLLAMA_DEBUG=1 set so we can see the early bootstrapping GPU discovery logic.

GiteaMirror commented

2026-04-22 06:25:55 -05:00

@AlexLJordan commented on GitHub (May 3, 2024):

These are logs that I store automatically; so they don't have OLLAMA_DEBUG set. It's late here, so if these logs aren't helpful, I'll need to rerun it with DEBUG tomorrow.

Ollama 1.32
ollama-1.32.log

Ollama 1.33
ollama-1.33.log

Thanks for your help!

@AlexLJordan commented on GitHub (May 3, 2024): These are logs that I store automatically; so they don't have OLLAMA_DEBUG set. It's late here, so if these logs aren't helpful, I'll need to rerun it with DEBUG tomorrow. Ollama 1.32 [ollama-1.32.log](https://github.com/ollama/ollama/files/15205209/ollama-1.32.log) Ollama 1.33 [ollama-1.33.log](https://github.com/ollama/ollama/files/15205210/ollama-1.33.log) Thanks for your help!

GiteaMirror commented

2026-04-22 06:25:55 -05:00

@dhiltgen commented on GitHub (May 3, 2024):

From the logs I can see that we did discover all 3 GPUs

time=2024-05-03T22:22:34.769+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2048242415/runners/cuda_v11/libcudart.so.11.0 count=3

Unfortunately without the debug set, I can't see why the scheduler decided to run on only a single GPU with only 3 layers. If you can re-run just the 0.1.33 with OLLAMA_DEBUG=1 and share the log that will help root cause the defect.

@dhiltgen commented on GitHub (May 3, 2024): From the logs I can see that we did discover all 3 GPUs ``` time=2024-05-03T22:22:34.769+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2048242415/runners/cuda_v11/libcudart.so.11.0 count=3 ``` Unfortunately without the debug set, I can't see why the scheduler decided to run on only a single GPU with only 3 layers. If you can re-run just the 0.1.33 with OLLAMA_DEBUG=1 and share the log that will help root cause the defect.

GiteaMirror commented

2026-04-22 06:25:56 -05:00

@bsdnet commented on GitHub (May 4, 2024):

@dhiltgen seems the log you referred is from ollama-1.32.log

@bsdnet commented on GitHub (May 4, 2024): @dhiltgen seems the log you referred is from [ollama-1.32.log](https://github.com/ollama/ollama/files/15205209/ollama-1.32.log)

GiteaMirror commented

2026-04-22 06:25:57 -05:00

@AlexLJordan commented on GitHub (May 4, 2024):

Hi again!

I was able to rerun the workload with DEBUG enabled on both versions [see below].

Weirdly enough Ollama 1.33 uses a full GPU this time:

It's still much slower than 1.32, where one set of jobs completes in half an hour; and 1.33 shows somewhere between 2 and 3.5h projected completion time.

<EDIT>
Additional weirdness:
Yesterdays run of 1.33 didn't have that msg="detected GPUs" library=/tmp/ollama2048242415/runners/cuda_v11/libcudart.so.11.0 count=3 line as bsdnet already pointed out.
But in the attached log the following line showed up:

time=2024-05-04T14:05:09.283+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2604474549/runners/cuda_v11/libcudart.so.11.0 count=3

</EDIT>

I'm relatively sure that the only changes from yesterday to today are adding this to the environment (.env) file where Ollama runs. (Turns out I had the Ollama EnvVars in the wrong file and NUM_PARALLEL as well as MAX_LOADED_MODELS weren't included in the environment yesterday.)

export OLLAMA_NUM_PARALLEL=16
export OLLAMA_MAX_LOADED_MODELS=3

export OLLAMA_DEBUG=1

ollama-1.32-DEBUG.log

ollama-1.33-DEBUG.log

@AlexLJordan commented on GitHub (May 4, 2024): Hi again! I was able to rerun the workload with DEBUG enabled on both versions [see below]. Weirdly enough Ollama 1.33 uses a full GPU this time: ![Selection_2024-05-04--001](https://github.com/ollama/ollama/assets/10133257/a2e4441c-2f7b-4451-847e-077d7a5f4f7c) It's still _much_ slower than 1.32, where one set of jobs completes in half an hour; and 1.33 shows somewhere between 2 and 3.5h projected completion time. \<EDIT> Additional weirdness: Yesterdays run of 1.33 didn't have that `msg="detected GPUs" library=/tmp/ollama2048242415/runners/cuda_v11/libcudart.so.11.0 count=3` line as bsdnet already pointed out. But in the attached log the following line showed up: ``` time=2024-05-04T14:05:09.283+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2604474549/runners/cuda_v11/libcudart.so.11.0 count=3 ``` \</EDIT> I'm relatively sure that the only changes from yesterday to today are adding this to the environment (`.env`) file where Ollama runs. (Turns out I had the Ollama EnvVars in the wrong file and `NUM_PARALLEL` as well as `MAX_LOADED_MODELS` weren't included in the environment yesterday.) ``` export OLLAMA_NUM_PARALLEL=16 export OLLAMA_MAX_LOADED_MODELS=3 export OLLAMA_DEBUG=1 ``` --- [ollama-1.32-DEBUG.log](https://github.com/ollama/ollama/files/15209662/ollama-1.32-DEBUG.log) [ollama-1.33-DEBUG.log](https://github.com/ollama/ollama/files/15209661/ollama-1.33-DEBUG.log)

GiteaMirror commented

2026-04-22 06:25:57 -05:00

@bsdnet commented on GitHub (May 4, 2024):

Not sure whether the issue comes from timing :)

Enabling debug usually means more logging; More logging usually means timing changed.

One way to confirm this is to run 1.33 without DEBUG enabled.

@bsdnet commented on GitHub (May 4, 2024): Not sure whether the issue comes from timing :) Enabling debug usually means more logging; More logging usually means timing changed. One way to confirm this is to run 1.33 without DEBUG enabled.

GiteaMirror commented

2026-04-22 06:25:58 -05:00

@dhiltgen commented on GitHub (May 4, 2024):

Based on your 0.1.33 log with debug enabled..

It sees all 3 GPUs:

time=2024-05-04T14:05:09.283+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2604474549/runners/cuda_v11/libcudart.so.11.0 count=3
time=2024-05-04T14:05:09.283+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] CUDA totalMem 34089730048
[GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] CUDA freeMem 33765720064
[GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] Compute Capability 7.0
[GPU-5105e575-3fba-efc4-c055-9b7051c99884] CUDA totalMem 34089730048
[GPU-5105e575-3fba-efc4-c055-9b7051c99884] CUDA freeMem 33765720064
[GPU-5105e575-3fba-efc4-c055-9b7051c99884] Compute Capability 7.0
[GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] CUDA totalMem 34089730048
[GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] CUDA freeMem 33765720064
[GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] Compute Capability 7.0

The scheduler determined the requested model could fit in a single GPU for best performance

time=2024-05-04T14:05:10.668+02:00 level=DEBUG source=sched.go:508 msg="new model will fit in available VRAM in single GPU, loading" model=/home/aljordan/.ollama/models/blobs/sha256-8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 gpu=GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7 available=33765720064 required="5222.6 MiB"

and we can see the backend loaded all the layers

ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
time=2024-05-04T14:05:10.922+02:00 level=DEBUG source=server.go:466 msg="server not yet available" error="server not responding"
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU

It is possible we have a scheduling race we haven't found/fixed yet since the scheduler code is brand new. If you manage to repro the failure mode of hitting a single GPU with partial offload, share the logs so we can see what the scheduler was doing.

@dhiltgen commented on GitHub (May 4, 2024): Based on your 0.1.33 log with debug enabled.. It sees all 3 GPUs: ``` time=2024-05-04T14:05:09.283+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2604474549/runners/cuda_v11/libcudart.so.11.0 count=3 time=2024-05-04T14:05:09.283+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" [GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] CUDA totalMem 34089730048 [GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] CUDA freeMem 33765720064 [GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] Compute Capability 7.0 [GPU-5105e575-3fba-efc4-c055-9b7051c99884] CUDA totalMem 34089730048 [GPU-5105e575-3fba-efc4-c055-9b7051c99884] CUDA freeMem 33765720064 [GPU-5105e575-3fba-efc4-c055-9b7051c99884] Compute Capability 7.0 [GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] CUDA totalMem 34089730048 [GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] CUDA freeMem 33765720064 [GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] Compute Capability 7.0 ``` The scheduler determined the requested model could fit in a single GPU for best performance ``` time=2024-05-04T14:05:10.668+02:00 level=DEBUG source=sched.go:508 msg="new model will fit in available VRAM in single GPU, loading" model=/home/aljordan/.ollama/models/blobs/sha256-8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 gpu=GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7 available=33765720064 required="5222.6 MiB" ``` and we can see the backend loaded all the layers ``` ggml_cuda_init: found 1 CUDA devices: Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes time=2024-05-04T14:05:10.922+02:00 level=DEBUG source=server.go:466 msg="server not yet available" error="server not responding" llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU ``` It is possible we have a scheduling race we haven't found/fixed yet since the scheduler code is brand new. If you manage to repro the failure mode of hitting a single GPU with partial offload, share the logs so we can see what the scheduler was doing.

GiteaMirror commented

2026-04-22 06:25:58 -05:00

@thevisad commented on GitHub (May 4, 2024):

I had the same issue today and rolled back to 1.31 and this resolved the issue. I spent the day in the discord chatting with the users, trying various things without resolution. I was able to up num_gpu to the amount required and it will then find and utilize both GPUs.


May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.005Z level=INFO source=images.go:828 msg="total blobs: 30"
May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.005Z level=INFO source=images.go:835 msg="total unused blobs removed: 0"
May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.006Z level=INFO source=routes.go:1071 msg="Listening on [::]:11434 (version 0.1.33)"
May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.006Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1884500785/runners
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.131Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11 rocm_v60002 cpu cpu_avx cpu_avx2]"
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.132Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.278Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1884500785/runners/cuda_v11/libcudart.so.11.0 count=2
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.278Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May  4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.533Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
May  4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.535Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1884500785/runners/cuda_v11/libcudart.so.11.0 count=2
May  4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.535Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="48402.9 MiB" memory.required.full="5222.6 MiB" memory.required.partial="5222.                                                               6 MiB" memory.required.kv="1024.0 MiB" memory.weights.total="3577.6 MiB" memory.weights.repeating="3475.0 MiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="193.0 MiB"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="48402.9 MiB" memory.required.full="5222.6 MiB" memory.required.partial="5222.                                                               6 MiB" memory.required.kv="1024.0 MiB" memory.weights.total="3577.6 MiB" memory.weights.repeating="3475.0 MiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="193.0 MiB"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1884500785/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-3a43                                                               f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 43549"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=sched.go:340 msg="loaded runners" count=1
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding"
May  4 15:52:03 prettygirl ollama[31772]: {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140550867894272","timestamp":1714837923}
May  4 15:52:03 prettygirl ollama[31772]: {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140550867894272","timestamp":1714837923}
May  4 15:52:03 prettygirl ollama[31772]: {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":6,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | N                                                               EON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140550867894272","timestamp":1714837923,"total_threads":12}
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2)
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   0:                       general.architecture str              = llama
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   1:                               general.name str              = codellama
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  19:               general.quantization_version u32              = 2
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type  f32:   65 tensors
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type q4_0:  225 tensors
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type q6_K:    1 tensors
May  4 15:52:03 prettygirl ollama[31666]: llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ).
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: format           = GGUF V2
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: arch             = llama
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: vocab type       = SPM
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_vocab          = 32016
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_merges         = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_ctx_train      = 16384
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd           = 4096
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_head           = 32
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_head_kv        = 32
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_layer          = 32
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_rot            = 128
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_head_k    = 128
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_head_v    = 128
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_gqa            = 1
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_k_gqa     = 4096
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_v_gqa     = 4096
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_norm_eps       = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_logit_scale    = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_ff             = 11008
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_expert         = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_expert_used    = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: causal attn      = 1
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: pooling type     = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope type        = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope scaling     = linear
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: freq_base_train  = 1000000.0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: freq_scale_train = 1
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_yarn_orig_ctx  = 16384
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope_finetuned   = unknown
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_conv       = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_inner      = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_state      = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_dt_rank      = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model type       = 7B
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model ftype      = Q4_0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model params     = 6.74 B
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: general.name     = codellama
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: BOS token        = 1 '<s>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: EOS token        = 2 '</s>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: UNK token        = 0 '<unk>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: LF token         = 13 '<0x0A>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: PRE token        = 32007 '▁<PRE>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: SUF token        = 32008 '▁<SUF>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: MID token        = 32009 '▁<MID>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: EOT token        = 32010 '▁<EOT>'
May  4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
May  4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
May  4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: found 1 CUDA devices:
May  4 15:52:03 prettygirl ollama[31666]:   Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: ggml ctx size =    0.30 MiB
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloading 32 repeating layers to GPU
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloading non-repeating layers to GPU
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloaded 33/33 layers to GPU
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors:        CPU buffer size =    70.35 MiB
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors:      CUDA0 buffer size =  3577.61 MiB
May  4 15:52:04 prettygirl ollama[31666]: ..................................................................................................
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_ctx      = 2048
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_batch    = 512
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_ubatch   = 512
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: freq_base  = 1000000.0
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: freq_scale = 1
May  4 15:52:04 prettygirl ollama[31666]: llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model:      CUDA0 compute buffer size =   164.00 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: graph nodes  = 1030
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: graph splits = 2

@thevisad commented on GitHub (May 4, 2024): I had the same issue today and rolled back to 1.31 and this resolved the issue. I spent the day in the discord chatting with the users, trying various things without resolution. I was able to up num_gpu to the amount required and it will then find and utilize both GPUs. ``` May 4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.005Z level=INFO source=images.go:828 msg="total blobs: 30" May 4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.005Z level=INFO source=images.go:835 msg="total unused blobs removed: 0" May 4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.006Z level=INFO source=routes.go:1071 msg="Listening on [::]:11434 (version 0.1.33)" May 4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.006Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1884500785/runners May 4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.131Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11 rocm_v60002 cpu cpu_avx cpu_avx2]" May 4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.132Z level=INFO source=gpu.go:96 msg="Detecting GPUs" May 4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.278Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1884500785/runners/cuda_v11/libcudart.so.11.0 count=2 May 4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.278Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" May 4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.533Z level=INFO source=gpu.go:96 msg="Detecting GPUs" May 4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.535Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1884500785/runners/cuda_v11/libcudart.so.11.0 count=2 May 4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.535Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="48402.9 MiB" memory.required.full="5222.6 MiB" memory.required.partial="5222. 6 MiB" memory.required.kv="1024.0 MiB" memory.weights.total="3577.6 MiB" memory.weights.repeating="3475.0 MiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="193.0 MiB" May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="48402.9 MiB" memory.required.full="5222.6 MiB" memory.required.partial="5222. 6 MiB" memory.required.kv="1024.0 MiB" memory.weights.total="3577.6 MiB" memory.weights.repeating="3475.0 MiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="193.0 MiB" May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1884500785/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-3a43 f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 43549" May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=sched.go:340 msg="loaded runners" count=1 May 4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding" May 4 15:52:03 prettygirl ollama[31772]: {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140550867894272","timestamp":1714837923} May 4 15:52:03 prettygirl ollama[31772]: {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140550867894272","timestamp":1714837923} May 4 15:52:03 prettygirl ollama[31772]: {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":6,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | N EON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140550867894272","timestamp":1714837923,"total_threads":12} May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2) May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 0: general.architecture str = llama May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 1: general.name str = codellama May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 2: llama.context_length u32 = 16384 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 4: llama.block_count u32 = 32 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 11: general.file_type u32 = 2 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 12: tokenizer.ggml.model str = llama May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<... May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000... May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv 19: general.quantization_version u32 = 2 May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type f32: 65 tensors May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type q4_0: 225 tensors May 4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type q6_K: 1 tensors May 4 15:52:03 prettygirl ollama[31666]: llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ). May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: format = GGUF V2 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: arch = llama May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: vocab type = SPM May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_vocab = 32016 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_merges = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_ctx_train = 16384 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd = 4096 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_head = 32 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_head_kv = 32 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_layer = 32 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_rot = 128 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_head_k = 128 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_head_v = 128 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_gqa = 1 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_k_gqa = 4096 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_v_gqa = 4096 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_norm_eps = 0.0e+00 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_logit_scale = 0.0e+00 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_ff = 11008 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_expert = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_expert_used = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: causal attn = 1 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: pooling type = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope type = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope scaling = linear May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: freq_base_train = 1000000.0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: freq_scale_train = 1 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_yarn_orig_ctx = 16384 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope_finetuned = unknown May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_conv = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_inner = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_state = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_dt_rank = 0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model type = 7B May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model ftype = Q4_0 May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model params = 6.74 B May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: general.name = codellama May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: BOS token = 1 '<s>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: EOS token = 2 '</s>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: UNK token = 0 '<unk>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: LF token = 13 '<0x0A>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: PRE token = 32007 '▁<PRE>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: SUF token = 32008 '▁<SUF>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: MID token = 32009 '▁<MID>' May 4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: EOT token = 32010 '▁<EOT>' May 4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes May 4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no May 4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: found 1 CUDA devices: May 4 15:52:03 prettygirl ollama[31666]: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: ggml ctx size = 0.30 MiB May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloading 32 repeating layers to GPU May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloading non-repeating layers to GPU May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloaded 33/33 layers to GPU May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: CPU buffer size = 70.35 MiB May 4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: CUDA0 buffer size = 3577.61 MiB May 4 15:52:04 prettygirl ollama[31666]: .................................................................................................. May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_ctx = 2048 May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_batch = 512 May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_ubatch = 512 May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: freq_base = 1000000.0 May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: freq_scale = 1 May 4 15:52:04 prettygirl ollama[31666]: llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: CUDA_Host output buffer size = 0.14 MiB May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: CUDA0 compute buffer size = 164.00 MiB May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: graph nodes = 1030 May 4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: graph splits = 2 ```

GiteaMirror commented

2026-04-22 06:25:59 -05:00

@JieChenSimon commented on GitHub (May 5, 2024):

same issue occurred to me when upgrade

@JieChenSimon commented on GitHub (May 5, 2024): same issue occurred to me when upgrade <img width="1752" alt="image" src="https://github.com/ollama/ollama/assets/46218041/9a199c99-c84e-488c-9f58-c11320bfa446">

GiteaMirror commented

2026-04-22 06:26:00 -05:00

@dhiltgen commented on GitHub (May 5, 2024):

@thevisad and @JieChenSimon from what I can tell, the system is behaving as expected in your examples. We try NOT to spread a single model over multiple GPUs now as that actually makes things run slower, not faster if the model could fit within one GPU. We now only spread a model to multiple GPUs if it wont fit in a single GPU. If that's not the behavior you're seeing, can you clarify?

@dhiltgen commented on GitHub (May 5, 2024): @thevisad and @JieChenSimon from what I can tell, the system is behaving as expected in your examples. We try NOT to spread a single model over multiple GPUs now as that actually makes things run slower, not faster if the model could fit within one GPU. We now only spread a model to multiple GPUs if it wont fit in a single GPU. If that's not the behavior you're seeing, can you clarify?

GiteaMirror commented

2026-04-22 06:26:01 -05:00

@wlsoft2006 commented on GitHub (May 6, 2024):

@wlsoft2006 commented on GitHub (May 6, 2024): ![image](https://github.com/ollama/ollama/assets/4631931/7e36ccb3-956f-47ed-9af6-8363ee04df99)

GiteaMirror commented

2026-04-22 06:26:02 -05:00

@wlsoft2006 commented on GitHub (May 6, 2024):

only one gpu in use after update to 1.33
Linux ai-centos7 3.10.0-1160.114.2.el7.x86_64 #1 SMP Wed Mar 20 15:54:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux release 7.9.2009 (Core)

@wlsoft2006 commented on GitHub (May 6, 2024): only one gpu in use after update to 1.33 Linux ai-centos7 3.10.0-1160.114.2.el7.x86_64 #1 SMP Wed Mar 20 15:54:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux CentOS Linux release 7.9.2009 (Core)

GiteaMirror commented

2026-04-22 06:26:03 -05:00

@wlsoft2006 commented on GitHub (May 6, 2024):

When I load both models at the same time it works！
That's all I need， No problem！

@wlsoft2006 commented on GitHub (May 6, 2024): ![image](https://github.com/ollama/ollama/assets/4631931/f7831d31-19b1-4386-91c0-1bbae797a4e8) When I load both models at the same time it works！ That's all I need， No problem！

GiteaMirror commented

2026-04-22 06:26:04 -05:00

@nyoma-diamond commented on GitHub (May 8, 2024):

EDIT: Turned out to be user error. My system's administrator for some reason decided to set the CUDA_VISIBLE_DEVICES environment variable for each user so they could only access one specific GPU (I happened to be specifically set to GPU 1). I thought I had CUDA_VISIBLE_DEVICES unset but when I checked again on a fresh bash session it was set to the device ID for GPU 1. Unsetting the variable or adding the IDs of other GPUs resolved this.

I'm also running into this problem. The system I am using has 4x Nvidia P100s but Ollama only sees one at any given moment (from what I can tell, always GPU 1, not 0, 2, or 3). However, I'm observing this behavior on both v0.1.32 and v0.1.34

time=2024-05-08T14:44:14.522+01:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
time=2024-05-08T14:44:14.575+01:00 level=INFO source=gpu.go:127 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.54.15

~~Output of nvidia-smi (abbreviated):~~

Wed May  8 14:49:06 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P100-PCIE-16GB           Off |   00000000:25:00.0 Off |                    0 |
| N/A   32C    P0             27W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           Off |   00000000:5B:00.0 Off |                    0 |
| N/A   50C    P0             42W /  250W |    5254MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla P100-PCIE-16GB           Off |   00000000:9B:00.0 Off |                    0 |
| N/A   33C    P0             26W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla P100-PCIE-16GB           Off |   00000000:C8:00.0 Off |                    0 |
| N/A   33C    P0             33W /  250W |     288MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

As a result, large models get partially loaded onto one GPU and any excess is offloaded to CPU instead of using the remaining three GPUs. In logs Ollama says it only detects the one GPU. Occurs both on v0.1.32 and v0.1.34 with or without OLLAMA_DEBUG enabled.

It may be worth noting that the GPU that Ollama detects is always GPU 1 (as listed in nvidia-smi). Since this system is shared across multiple users, this also causes problems when someone is already using the selected GPU, causing Ollama to offload the entire model to the CPU, rather than using any of the other completely free GPUs.

@nyoma-diamond commented on GitHub (May 8, 2024): **EDIT: Turned out to be user error. My system's administrator for some reason decided to set the `CUDA_VISIBLE_DEVICES` environment variable for each user so they could only access one specific GPU (I happened to be specifically set to GPU 1). I thought I had `CUDA_VISIBLE_DEVICES` unset but when I checked again on a fresh bash session it was set to the device ID for GPU 1. Unsetting the variable or adding the IDs of other GPUs resolved this.** ~~I'm also running into this problem. The system I am using has 4x Nvidia P100s but Ollama only sees one at any given moment (from what I can tell, always GPU 1, not 0, 2, or 3). However, I'm observing this behavior on both v0.1.32 and v0.1.34~~ ``` time=2024-05-08T14:44:14.522+01:00 level=INFO source=gpu.go:122 msg="Detecting GPUs" time=2024-05-08T14:44:14.575+01:00 level=INFO source=gpu.go:127 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.54.15 ``` ~~Output of `nvidia-smi` (abbreviated):~~ ``` Wed May 8 14:49:06 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla P100-PCIE-16GB Off | 00000000:25:00.0 Off | 0 | | N/A 32C P0 27W / 250W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 Tesla P100-PCIE-16GB Off | 00000000:5B:00.0 Off | 0 | | N/A 50C P0 42W / 250W | 5254MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 Tesla P100-PCIE-16GB Off | 00000000:9B:00.0 Off | 0 | | N/A 33C P0 26W / 250W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 Tesla P100-PCIE-16GB Off | 00000000:C8:00.0 Off | 0 | | N/A 33C P0 33W / 250W | 288MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` ~~As a result, large models get partially loaded onto one GPU and any excess is offloaded to CPU instead of using the remaining three GPUs. In logs Ollama says it only detects the one GPU. Occurs both on v0.1.32 and v0.1.34 with or without OLLAMA_DEBUG enabled.~~ ~~It may be worth noting that the GPU that Ollama detects is always GPU 1 (as listed in `nvidia-smi`). Since this system is shared across multiple users, this also causes problems when someone is already using the selected GPU, causing Ollama to offload the entire model to the CPU, rather than using any of the other completely free GPUs.~~

GiteaMirror commented

2026-04-22 06:26:05 -05:00

@dhiltgen commented on GitHub (May 8, 2024):

I'm working on a change that will expose this setting in the logs during startup so it's easier to spot misconfigurations.

What I also noticed is we have a regression in 0.1.34 where CUDA_VISIBLE_DEVICES is no longer filtering out GPUs since we switched from the cuda runtime library to the nvidia driver library in the latest release. I'll look at adding a fix for that in the PR as well.

Update: my test was incorrect, CUDA_VISIBLE_DEVICES is still working properly.

@dhiltgen commented on GitHub (May 8, 2024): I'm working on a change that will expose this setting in the logs during startup so it's easier to spot misconfigurations. ~~What I also noticed is we have a regression in 0.1.34 where CUDA_VISIBLE_DEVICES is no longer filtering out GPUs since we switched from the cuda runtime library to the nvidia driver library in the latest release. I'll look at adding a fix for that in the PR as well.~~ **Update**: my test was incorrect, CUDA_VISIBLE_DEVICES is still working properly.

GiteaMirror commented

2026-04-22 06:26:08 -05:00

@ToRvaLDz commented on GitHub (May 20, 2024):

I have the same problem in docker, I have 13 gpus but it only find 1:

ggml_cuda_init: found 1 CUDA devices:  Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes

NVIDIA_VISIBLE_DEVICES=all
HOSTNAME=502558fb132a
PWD=/
NVIDIA_DRIVER_CAPABILITIES=compute,utility
OLLAMA_MAX_LOADED_MODELS=3
CUDA_VISIBLE_DEVICES=12
OLLAMA_HOST=0.0.0.0
TERM=xterm
SHLVL=1
OLLAMA_NUM_PARALLEL=12
OLLAMA_DEBUG=0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
OLLAMA_KEEP_ALIVE=24h
_=/usr/bin/env

time=2024-05-20T11:50:34.699Z level=INFO source=images.go:704 msg="total blobs: 5"
time=2024-05-20T11:50:34.700Z level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-05-20T11:50:34.701Z level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)"
time=2024-05-20T11:50:34.701Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1665127007/runners
time=2024-05-20T11:50:38.352Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]"
time=2024-05-20T11:50:44.849Z level=INFO source=types.go:71 msg="inference compute" id=GPU-4faa64e1-cd46-b533-d16d-c39809fde7ac library=cuda compute=7.5 driver=12.4 name="NVIDIA GeForce GTX 1660 SUPER" total="5.8 GiB" available="5.7 GiB"                                  ttl=64 time=0.732 ms
[GIN] 2024/05/20 - 11:51:13 | 200 |     572.484µs |      172.18.0.1 | GET      "/api/tags"
[GIN] 2024/05/20 - 11:51:13 | 200 |     346.879µs |      172.18.0.1 | GET      "/api/tags"
[GIN] 2024/05/20 - 11:51:13 | 200 |     274.741µs |      172.18.0.1 | GET      "/api/tags"                          8: INFO server config env="map[OLLAMA_DEBUG:false 
[GIN] 2024/05/20 - 11:51:13 | 200 |      29.468µs |      172.18.0.1 | GET      "/api/version"                       M_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost http
[GIN] 2024/05/20 - 11:51:20 | 200 |      90.274µs |      172.18.0.1 | GET      "/api/version"                       27.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0
time=2024-05-20T11:51:25.103Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB"
time=2024-05-20T11:51:25.104Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB"
time=2024-05-20T11:51:25.105Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB"
time=2024-05-20T11:51:25.105Z level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama1665127007/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 --ctx-size 24576 --batch-size 512 --embedding --log-disable --n-gpu-layers 16 --parallel 12 --port 45825"
time=2024-05-20T11:51:25.105Z level=INFO source=sched.go:338 msg="loaded runners" count=1                           3 memory.available="5.7 GiB" memory.required.full=
time=2024-05-20T11:51:25.105Z level=INFO source=server.go:504 msg="waiting for llama runner to start responding"    ghts.repeating="3.7 GiB" memory.weights.nonrepeati
time=2024-05-20T11:51:25.106Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="952d03d" tid="134968727871488" timestamp=1716205885                        1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668
INFO [main] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134968727871488" timestamp=1716205885 total_threads=4
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="14" port="45825" tid="134968727871488" timestamp=1716205885
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                    | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | N
llama_model_loader: - kv   0:                       general.architecture str              = llama                   INT8 = 0 | LLAMAFILE = 1 | " tid="129685691437056"
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct | hostname="127.0.0.1" n_threads_http="3" port="4
llama_model_loader: - kv   2:                          llama.block_count u32              = 32                      ata with 21 key-value pairs and 291 tensors from /
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192                    latest))
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2                        1, 1, 1, 1, ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-05-20T11:51:25.357Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4437.80 MiB
llm_load_tensors:      CUDA0 buffer size =  1872.50 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 24576
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  1536.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1536.00 MiB
llama_new_context_with_model: KV self size  = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     6.06 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1705.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    56.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 180

Inside the docker container:

Mon May 20 11:54:05 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1660 ...    Off |   00000000:01:00.0 Off |                  N/A |
| 40%   46C    P0             26W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1660 ...    Off |   00000000:04:00.0 Off |                  N/A |
| 37%   43C    P0             30W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce GTX 1660 ...    Off |   00000000:06:00.0 Off |                  N/A |
| 39%   43C    P0             31W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce GTX 1660 ...    Off |   00000000:08:00.0 Off |                  N/A |
| 42%   47C    P0             28W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce GTX 1660 ...    Off |   00000000:09:00.0 Off |                  N/A |
| 41%   44C    P0             31W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0A:00.0 Off |                  N/A |
| 36%   41C    P0             33W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0B:00.0 Off |                  N/A |
| 45%   43C    P0             30W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0C:00.0 Off |                  N/A |
| 43%   45C    P0             31W /  125W |       1MiB /   6144MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   8  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0D:00.0 Off |                  N/A |
| 43%   44C    P0             30W /  125W |       1MiB /   6144MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   9  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0E:00.0 Off |                  N/A |
| 22%   44C    P0             28W /  125W |       1MiB /   6144MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|  10  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0F:00.0 Off |                  N/A |
| 29%   44C    P0             31W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|  11  NVIDIA GeForce GTX 1660 ...    Off |   00000000:10:00.0 Off |                  N/A |
| 20%   43C    P0             32W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|  12  NVIDIA GeForce GTX 1660 ...    Off |   00000000:11:00.0 Off |                  N/A |
| 45%   47C    P2             30W /  125W |    5441MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

@ToRvaLDz commented on GitHub (May 20, 2024): I have the same problem in docker, I have 13 gpus but it only find 1: ``` ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes ``` ```root@502558fb132a:/# env NVIDIA_VISIBLE_DEVICES=all HOSTNAME=502558fb132a PWD=/ NVIDIA_DRIVER_CAPABILITIES=compute,utility OLLAMA_MAX_LOADED_MODELS=3 CUDA_VISIBLE_DEVICES=12 OLLAMA_HOST=0.0.0.0 TERM=xterm SHLVL=1 OLLAMA_NUM_PARALLEL=12 OLLAMA_DEBUG=0 LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_KEEP_ALIVE=24h _=/usr/bin/env ``` ```024/05/20 11:50:34 routes.go:1008: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:3 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:12 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]" time=2024-05-20T11:50:34.699Z level=INFO source=images.go:704 msg="total blobs: 5" time=2024-05-20T11:50:34.700Z level=INFO source=images.go:711 msg="total unused blobs removed: 0" time=2024-05-20T11:50:34.701Z level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)" time=2024-05-20T11:50:34.701Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1665127007/runners time=2024-05-20T11:50:38.352Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]" time=2024-05-20T11:50:44.849Z level=INFO source=types.go:71 msg="inference compute" id=GPU-4faa64e1-cd46-b533-d16d-c39809fde7ac library=cuda compute=7.5 driver=12.4 name="NVIDIA GeForce GTX 1660 SUPER" total="5.8 GiB" available="5.7 GiB" ttl=64 time=0.732 ms [GIN] 2024/05/20 - 11:51:13 | 200 | 572.484µs | 172.18.0.1 | GET "/api/tags" [GIN] 2024/05/20 - 11:51:13 | 200 | 346.879µs | 172.18.0.1 | GET "/api/tags" [GIN] 2024/05/20 - 11:51:13 | 200 | 274.741µs | 172.18.0.1 | GET "/api/tags" 8: INFO server config env="map[OLLAMA_DEBUG:false [GIN] 2024/05/20 - 11:51:13 | 200 | 29.468µs | 172.18.0.1 | GET "/api/version" M_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost http [GIN] 2024/05/20 - 11:51:20 | 200 | 90.274µs | 172.18.0.1 | GET "/api/version" 27.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0 time=2024-05-20T11:51:25.103Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB" time=2024-05-20T11:51:25.104Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB" time=2024-05-20T11:51:25.105Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB" time=2024-05-20T11:51:25.105Z level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama1665127007/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 --ctx-size 24576 --batch-size 512 --embedding --log-disable --n-gpu-layers 16 --parallel 12 --port 45825" time=2024-05-20T11:51:25.105Z level=INFO source=sched.go:338 msg="loaded runners" count=1 3 memory.available="5.7 GiB" memory.required.full= time=2024-05-20T11:51:25.105Z level=INFO source=server.go:504 msg="waiting for llama runner to start responding" ghts.repeating="3.7 GiB" memory.weights.nonrepeati time=2024-05-20T11:51:25.106Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="952d03d" tid="134968727871488" timestamp=1716205885 1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668 INFO [main] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134968727871488" timestamp=1716205885 total_threads=4 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="14" port="45825" tid="134968727871488" timestamp=1716205885 llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | N llama_model_loader: - kv 0: general.architecture str = llama INT8 = 0 | LLAMAFILE = 1 | " tid="129685691437056" llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct | hostname="127.0.0.1" n_threads_http="3" port="4 llama_model_loader: - kv 2: llama.block_count u32 = 32 ata with 21 key-value pairs and 291 tensors from / llama_model_loader: - kv 3: llama.context_length u32 = 8192 latest)) llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 20: general.quantization_version u32 = 2 1, 1, 1, 1, ... llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-05-20T11:51:25.357Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab: llm_load_vocab: ************************************ llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED! llm_load_vocab: CONSIDER REGENERATING THE MODEL llm_load_vocab: ************************************ llm_load_vocab: llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 16 repeating layers to GPU llm_load_tensors: offloaded 16/33 layers to GPU llm_load_tensors: CPU buffer size = 4437.80 MiB llm_load_tensors: CUDA0 buffer size = 1872.50 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 24576 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 1536.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1536.00 MiB llama_new_context_with_model: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 6.06 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1705.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 56.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 180 ``` Inside the docker container: ```root@502558fb132a:/# nvidia-smi Mon May 20 11:54:05 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GTX 1660 ... Off | 00000000:01:00.0 Off | N/A | | 40% 46C P0 26W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce GTX 1660 ... Off | 00000000:04:00.0 Off | N/A | | 37% 43C P0 30W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce GTX 1660 ... Off | 00000000:06:00.0 Off | N/A | | 39% 43C P0 31W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce GTX 1660 ... Off | 00000000:08:00.0 Off | N/A | | 42% 47C P0 28W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA GeForce GTX 1660 ... Off | 00000000:09:00.0 Off | N/A | | 41% 44C P0 31W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA GeForce GTX 1660 ... Off | 00000000:0A:00.0 Off | N/A | | 36% 41C P0 33W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA GeForce GTX 1660 ... Off | 00000000:0B:00.0 Off | N/A | | 45% 43C P0 30W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA GeForce GTX 1660 ... Off | 00000000:0C:00.0 Off | N/A | | 43% 45C P0 31W / 125W | 1MiB / 6144MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 8 NVIDIA GeForce GTX 1660 ... Off | 00000000:0D:00.0 Off | N/A | | 43% 44C P0 30W / 125W | 1MiB / 6144MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 9 NVIDIA GeForce GTX 1660 ... Off | 00000000:0E:00.0 Off | N/A | | 22% 44C P0 28W / 125W | 1MiB / 6144MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 10 NVIDIA GeForce GTX 1660 ... Off | 00000000:0F:00.0 Off | N/A | | 29% 44C P0 31W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 11 NVIDIA GeForce GTX 1660 ... Off | 00000000:10:00.0 Off | N/A | | 20% 43C P0 32W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 12 NVIDIA GeForce GTX 1660 ... Off | 00000000:11:00.0 Off | N/A | | 45% 47C P2 30W / 125W | 5441MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+ ```

GiteaMirror commented

2026-04-22 06:26:08 -05:00

@dhiltgen commented on GitHub (May 20, 2024):

@ToRvaLDz CUDA_VISIBLE_DEVICES=12 will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12

@dhiltgen commented on GitHub (May 20, 2024): @ToRvaLDz `CUDA_VISIBLE_DEVICES=12` will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12`

GiteaMirror commented

2026-04-22 06:26:09 -05:00

@ToRvaLDz commented on GitHub (May 21, 2024):

@ToRvaLDz CUDA_VISIBLE_DEVICES=12 will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12

I'm sorry, you a re right. Thank you.

@ToRvaLDz commented on GitHub (May 21, 2024): > @ToRvaLDz `CUDA_VISIBLE_DEVICES=12` will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12` I'm sorry, you a re right. Thank you.

GiteaMirror commented

2026-04-22 06:26:10 -05:00

@dhiltgen commented on GitHub (May 21, 2024):

I'm going to mark this one closed now as the visible devices env var seems to be working properly. I am working on some improvements in concurrency memory predictions that help when operating at near max vram allocation, which should land in an upcoming release.

@dhiltgen commented on GitHub (May 21, 2024): I'm going to mark this one closed now as the visible devices env var seems to be working properly. I am working on some improvements in concurrency memory predictions that help when operating at near max vram allocation, which should land in an upcoming release.

GiteaMirror commented

2026-04-22 06:26:11 -05:00

@techResearcher2021 commented on GitHub (Jun 3, 2024):

It does not work inside docker container with exposing the env var CUDA_VISIBLE_DEVICES=0,1, I use the docker image of 0.1.41, with dual RTX 4090.
Here shows part of the logs:
time=2024-06-03T11:29:50.606Z level=INFO source=types.go:71 msg="inference compute" id=GPU-70127701-8921-747f-9194-ce6a8699d820 library=cuda compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="23.2 GiB"
time=2024-06-03T11:29:50.606Z level=INFO source=types.go:71 msg="inference compute" id=GPU-61837e28-1bfe-a560-ddd2-0a14a55cf642 library=cuda compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="23.3 GiB"
time=2024-06-03T11:30:02.386Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=65 memory.available="23.3 GiB" memory.required.full="22.8 GiB" memory.required.partial="22.8 GiB" memory.required.kv="4.0 GiB" memory.weights.total="16.8 GiB" memory.weights.repeating="16.2 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB"
time=2024-06-03T11:30:02.388Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=65 memory.available="23.3 GiB" memory.required.full="22.8 GiB" memory.required.partial="22.8 GiB" memory.required.kv="4.0 GiB" memory.weights.total="16.8 GiB" memory.weights.repeating="16.2 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB"
time=2024-06-03T11:30:02.388Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama2149316569/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-0688760683b9ca390070d62d06bdba06593d200cf07456478e4baeb66655c64b --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 65 --flash-attn --parallel 2 --port 45911"
time=2024-06-03T11:30:02.389Z level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-03T11:30:02.389Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-03T11:30:02.389Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="5921b8f" tid="140422165536768" timestamp=1717414202
INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140422165536768" timestamp=1717414202 total_threads=80
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="45911" tid="140422165536768" timestamp=1717414202
llama_model_loader: loaded meta data with 20 key-value pairs and 771 tensors from /root/.ollama/models/blobs/sha256-0688760683b9ca390070d62d06bdba06593d200cf07456478e4baeb66655c64b (version GGUF V3 (latest))
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 0.74 MiB
time=2024-06-03T11:30:04.096Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server not responding"
time=2024-06-03T11:30:04.801Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"

@techResearcher2021 commented on GitHub (Jun 3, 2024): It does not work inside docker container with exposing the env var CUDA_VISIBLE_DEVICES=0,1, I use the docker image of 0.1.41, with dual RTX 4090. Here shows part of the logs: time=2024-06-03T11:29:50.606Z level=INFO source=types.go:71 msg="inference compute" id=GPU-70127701-8921-747f-9194-ce6a8699d820 library=cuda compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="23.2 GiB" time=2024-06-03T11:29:50.606Z level=INFO source=types.go:71 msg="inference compute" id=GPU-61837e28-1bfe-a560-ddd2-0a14a55cf642 library=cuda compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="23.3 GiB" time=2024-06-03T11:30:02.386Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=65 memory.available="23.3 GiB" memory.required.full="22.8 GiB" memory.required.partial="22.8 GiB" memory.required.kv="4.0 GiB" memory.weights.total="16.8 GiB" memory.weights.repeating="16.2 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB" time=2024-06-03T11:30:02.388Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=65 memory.available="23.3 GiB" memory.required.full="22.8 GiB" memory.required.partial="22.8 GiB" memory.required.kv="4.0 GiB" memory.weights.total="16.8 GiB" memory.weights.repeating="16.2 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB" time=2024-06-03T11:30:02.388Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama2149316569/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-0688760683b9ca390070d62d06bdba06593d200cf07456478e4baeb66655c64b --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 65 --flash-attn --parallel 2 --port 45911" time=2024-06-03T11:30:02.389Z level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-03T11:30:02.389Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-03T11:30:02.389Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="5921b8f" tid="140422165536768" timestamp=1717414202 INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140422165536768" timestamp=1717414202 total_threads=80 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="45911" tid="140422165536768" timestamp=1717414202 llama_model_loader: loaded meta data with 20 key-value pairs and 771 tensors from /root/.ollama/models/blobs/sha256-0688760683b9ca390070d62d06bdba06593d200cf07456478e4baeb66655c64b (version GGUF V3 (latest)) ... ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: **Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes** llm_load_tensors: ggml ctx size = 0.74 MiB time=2024-06-03T11:30:04.096Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server not responding" time=2024-06-03T11:30:04.801Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"

GiteaMirror commented

2026-04-22 06:26:12 -05:00

@dhiltgen commented on GitHub (Jun 4, 2024):

@techResearcher2021 the model you're loading fits in 1 GPU, so it's only using 1. If you tried to load a larger model that needs more VRAM than one of your GPUs, it would use both.

There's a feature enhancement tracking allowing spread even when the model fits in one GPU tracked via #4198

@dhiltgen commented on GitHub (Jun 4, 2024): @techResearcher2021 the model you're loading fits in 1 GPU, so it's only using 1. If you tried to load a larger model that needs more VRAM than one of your GPUs, it would use both. There's a feature enhancement tracking allowing spread even when the model fits in one GPU tracked via #4198

GiteaMirror commented

2026-04-22 06:26:13 -05:00

@userbox020 commented on GitHub (Aug 27, 2024):

whats the equivalent for expose cuda devices on AMD, im hain same problem but with my amd cards

@userbox020 commented on GitHub (Aug 27, 2024): whats the equivalent for expose cuda devices on AMD, im hain same problem but with my amd cards

GiteaMirror commented

2026-04-22 06:26:13 -05:00

@dhiltgen commented on GitHub (Sep 3, 2024):

@userbox020 see https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection-1

@dhiltgen commented on GitHub (Sep 3, 2024): @userbox020 see https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection-1

GiteaMirror commented

2026-04-22 06:26:14 -05:00

@userbox020 commented on GitHub (Sep 25, 2024):

@userbox020 see https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection-1

thanks bro, you the best!

@userbox020 commented on GitHub (Sep 25, 2024): > @userbox020 see https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection-1 thanks bro, you the best!

GiteaMirror commented

2026-04-22 06:26:15 -05:00

@accqaz commented on GitHub (Jan 10, 2025):

@ToRvaLDz CUDA_VISIBLE_DEVICES=12 will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12

Hello！ I use OLLAMA_FLASH_ATTENTION=1 CUDA_VISIBLE_DEVICES=0,1 bin/ollama serve，but I met the question Error: listen tcp 127.0.0.1:11434: bind: address already in use. Could you please help how to slove it? I want to use run qwen2.5-72b model, it can only detect one device: the card RTXA6000, but it was to slow and often runtime error. I want to ask how to speed it up? (No docker, just use ollama serve. Thank you very much!

@accqaz commented on GitHub (Jan 10, 2025): > @ToRvaLDz `CUDA_VISIBLE_DEVICES=12` will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12` Hello！ I use `OLLAMA_FLASH_ATTENTION=1 CUDA_VISIBLE_DEVICES=0,1 bin/ollama serve`，but I met the question `Error: listen tcp 127.0.0.1:11434: bind: address already in use`. Could you please help how to slove it? I want to use run qwen2.5-72b model, it can only detect one device: the card RTXA6000, but it was to slow and often runtime error. I want to ask how to speed it up? (No docker, just use `ollama serve`. Thank you very much!

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#28332